# **Exploratory Data Analysis in Python using pandas**

Chanin Nantasenamat

<i>[Data Professor YouTube channel](http://youtube.com/dataprofessor), http://youtube.com/dataprofessor </i>

In this Jupyter notebook, I will be showing you how to perform Exploratory Data Analysis on web scraped data of NBA player stats as obtained in a previous [Jupyter notebook](https://github.com/dataprofessor/code/blob/master/python/pandas_read_html_for_webscraping.ipynb) as discussed on our YouTube video [Easy Web Scraping in Python using Pandas for Data Science](https://www.youtube.com/watch?v=SPu_5EfswIE).

## **Web scraping data using pandas**

The following block of code will retrieve the "2018-19 NBA Player Stats: Per Game" data from http://www.basketball-reference.com/.

In [None]:
# import os
from urllib import request
import pandas as pd
import ssl

# Retrieve HTML table data
url = 'https://www.basketball-reference.com/leagues/NBA_2019_per_game.html'
context = ssl._create_unverified_context()
response = request.urlopen(url, context=context)
html = response.read()
df = pd.read_html(url, header = 0)
df2019 = df[0]

# Data cleaning
raw = df2019.drop(df2019[df2019.Age == 'Age'].index)
raw


## **Acronyms**


Acronym | Description
---|---
Rk | Rank
Pos | Position
Age | Player's age on February 1 of the season
Tm | Team
G | Games
GS | Games Started
MP | Minutes Played Per Game
FG | Field Goals Per Game
FGA | Field Goal Attempts Per Game
FG% | Field Goal Percentage
3P | 3-Point Field Goals Per Game
3PA | 3-Point Field Goal Attempts Per Game
3P% | FG% on 3-Pt FGAs.
2P | 2-Point Field Goals Per Game
2PA | 2-Point Field Goal Attempts Per Game
2P% | FG% on 2-Pt FGAs.
eFG% | Effective Field Goal Percentage
| *(Note: This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.)*
FT | Free Throws Per Game
FTA | Free Throw Attempts Per Game
FT% | Free Throw Percentage
ORB | Offensive Rebounds Per Game
DRB | Defensive Rebounds Per Game
TRB | Total Rebounds Per Game
AST | Assists Per Game
STL | Steals Per Game
BLK | Blocks Per Game
TOV | Turnovers Per Game
PF | Personal Fouls Per Game
PTS | Points Per Game

## **Data cleaning**

### Data dimension

In [None]:
raw.shape

### Dataframe contents

In [None]:
raw.head()

### Check for missing values

In [None]:
raw.isnull().sum()

### Replace missing values with 0 

In [None]:
df = raw.fillna(0)

In [None]:
df.isnull().sum()

In [None]:
df = df.drop(['Rk'], axis=1)
df

### Write to CSV file

In [None]:
df.to_csv('nba2019.csv', index=False)

In [None]:
! ls

In [None]:
! cat nba2019.csv

## **Exploratory Data Analysis**

### Read data

In [None]:
df = pd.read_csv('nba2019.csv')

#### Displays the dataframe

In [None]:
df

If we want to see more...

In [None]:
pd.set_option('display.max_rows', df.shape[0]+1)

In [None]:
df

Reverting back to the default

In [None]:
pd.set_option('display.max_rows', 10)

In [None]:
df

### Overview of data types of each columns in the dataframe

In [None]:
df.dtypes

### Show specific data types in dataframe

In [None]:
df.select_dtypes(include=['number'])

In [None]:
df.select_dtypes(include=['object'])

## **QUESTIONS**

### **Conditional Selection**

In performing exploratory data analysis, it is important to be able to select subsets of data to perform analysis or comparisons.

**Which player scored the most Points (PTS) Per Game?**
Here, we will return the entire row.

In [None]:
df[df.PTS == df.PTS.max()]

We will return specific column values.

Further question, what team is the player from? 

In [None]:
PlayerMaxPoints = df[df.PTS == df.PTS.max()]
PlayerMaxPoints.Tm

Which position is the player playing as?

In [None]:
PlayerMaxPoints.Pos

How many games did the player played in the season?

In [None]:
PlayerMaxPoints.G

**Which player scored more than 20 Points (PTS) Per Game?**

In [None]:
df[df.PTS > 20]

**Which player had the highest 3-Point Field Goals Per Game (3P) ?**

In [None]:
df[df['3P'] == df['3P'].max()]

**Which player had the highest Assists Per Game (AST) ?**

In [None]:
df[df['AST'] == df['AST'].max()]

### **GroupBy() function**

**Which player scored the highest (PTS) in the Los Angeles Lakers?**

In [None]:
LAL = df.groupby('Tm').get_group('LAL')

In [None]:
LAL[LAL.PTS == LAL.PTS.max()]

**Of the 5 positions, which position scores the most points?**

We first group players by their positions.

In [None]:
df.groupby('Pos').PTS.describe()

We will now show only the 5 traditional positions (those having combo positions will be removed from the analysis).

In [None]:
positions = ['C','PF','SF','PG','SG']
POS = df[ df['Pos'].isin(positions)  ]
POS

Now, let's take a look at the descriptive statistics.

In [None]:
POS.groupby('Pos').PTS.describe()

### **Histograms**

We'll also try to answer this question by showing some histogram plots. So, to make it a bit easier, let's create a subset dataframe.

In [None]:
PTS = df[['Pos','PTS']]

positions = ['C','PF','SF','PG','SG']
PTS = PTS[ PTS['Pos'].isin(positions)  ]

PTS

#### **pandas built-in visualization**

In [None]:
PTS['PTS'].hist(by=PTS['Pos'])

In [None]:
PTS['PTS'].hist(by=PTS['Pos'], layout=(1,5))

In [None]:
PTS['PTS'].hist(by=PTS['Pos'], layout=(1,5), figsize=(16,2))


#### **Seaborn data visualization**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.FacetGrid(PTS, col="Pos")
g.map(plt.hist, "PTS");

### **Box plots**

#### **Box plot of points scored (PTS) grouped by Position**

##### **pandas built-in visualization**

In [None]:
PTS.boxplot(column='PTS', by='Pos')

##### **Seaborn data visualization**

In [None]:
import seaborn as sns

sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 

In [None]:
sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 
sns.stripplot(x = 'Pos', y = 'PTS', data = PTS,
              jitter=True, 
              marker='o',
              alpha=0.8, 
              color="black")

### **Heat map**

#### Compute the correlation matrix

In [None]:
corr = df.corr()
corr

#### Make the heat map

In [None]:
sns.heatmap(corr)

#### Adjust figure size of heat map

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, square=True)

#### Mask diagonal half of heat map (Diagonal correlation matrix)

In [None]:
# https://seaborn.pydata.org/generated/seaborn.heatmap.html

import numpy as np
import seaborn as sns

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, mask=mask, vmax=1, square=True)

### **Scatter Plot**

In [None]:
df

#### Select columns if they have numerical data types

In [None]:
df.select_dtypes(include=['number'])

#### Select the first 5 columns (by index number)

In [None]:
number = df.select_dtypes(include=['number'])

In [None]:
number.iloc[:,:5]

#### Select 5 specific columns (by column names)

In [None]:
selections = ['Age', 'G', 'STL', 'BLK', 'AST', 'PTS']
df5 = df[selections]
df5

#### Make scatter plot grid

##### 5 columns

In [None]:
import seaborn as sns

g = sns.PairGrid(df5)
g.map(plt.scatter);

##### All columns

In [None]:
import seaborn as sns

g = sns.PairGrid(number)
g.map(plt.scatter);