In [None]:
import pandas as pd


In [None]:
# Import data from the CSV file to a pandas DataFrame.
player_df = pd.read_csv('player_data.txt')

In [None]:
# Print out the first five rows of the player_df DataFrame.
player_df.head(20)

In [None]:
# Total up the number of NaN values in each row of the DataFrame.
player_df.isna()

In [None]:
# Total up the number of NaN values in each row of the DataFrame.
player_df.isna().sum()

In [None]:
#Print out the information about the Dataframe
player_df.info()

Drop columns
To drop columns, you'll use the dropna() method. Like isna(), dropna() looks for NaN values. But it goes a step further and removes either the rows or the columns that contain NaN values. Start by setting some parameters to the method:

By default, dropna() removes rows, so specify that you want to remove columns by using the axis parameter.
The dropna() method usually returns a new DataFrame. Use the inplace parameter to tell it to drop these columns in the original player_df DataFrame.
You also want dropna() to remove only columns in which all of the values are missing. So set the how parameter to 'all'.

In [None]:
#Drop columns that have no values
#putting the axxis=columns will remove columns
player_df.dropna(axis='columns', inplace=True, how='all')
player_df.isna().sum()

In [None]:
player_df.info()

Drop rows
You should also address the possibility that an entire row is missing values. That is, one of the players in the DataFrame might have no stats. You can try to address this missing row as you did last time, by using dropna(). But this time, use the method's default row-based behavior.

In [None]:
# Drop rows that have no values.
player_df.dropna(inplace=True, how='all')
player_df.isna().sum()

In [None]:
# Show the entire DataFrame.
player_df

Now you see that three rows are missing the same three values. Another row is missing 10 values. This information indicates two things:

The dataset likely comes from two datasets that were joined together, and both of these earlier datasets had missing rows.
You'll need to use critical thinking to determine how to remove these rows.
The dataset is small enough that you could manually drop the problem rows. But that shortcut wouldn't give you practice dealing with larger datasets, where manual removal isn't practical. So use the built-in pandas methods instead.

The how parameter in dropna() can be set to only 'any' or 'all'. Neither of those settings will get you what you need. Instead, use the thresh parameter.

The thresh parameter refers to threshold. This parameter lets you set the minimum number of non-NaN values a row or column needs to avoid being dropped by dropna(). To remove specific rows from the DataFrame, set thresh to 12.

In [None]:
# Drop all rows that don't have at least 12 non-NaN values.
player_df.dropna(inplace=True, thresh=12)
player_df.isna().sum()

Because you've dropped rows, the index in the DataFrame is compromised. You see this problem if you print out the first 10 rows of the DataFrame:

In [None]:
# Print the first 10 rows of the player_df DataFrame.
player_df.head(10)

You see that the index counts 0 through 10, skipping 8. The row that had the index of 8 was dropped because it had more than two NaN values. In the 14 columns, the rows that had three or more NaN values didn't meet the threshold of 12 you set when you dropped rows earlier.

To fix this problem, reset the index for the DataFrame. This fix saves you from problems down the road when you're working with the DataFrame. While you're at it, take a look at the now smaller DataFrame.

In [None]:
# Renumber the DataFrame index to account for the dropped rows.
player_df.reset_index(drop=True, inplace=True)
player_df.info()

Check for outliers

Outliers are data values so far outside the distribution of other values that they bring into question whether they even belong in the dataset. Outliers often arise from data errors or other undesirable noise. You'll always need to check for and deal with possible outliers before you analyze the data.

A quick way to identify outliers is to use the pandas describe() function:


In [None]:
player_df

In [None]:
player_df.describe()

Here you see, for example, that the mean for all 42 players is 1592.38 points. But look at the numbers for min (183), 25% (1390.75), 50% (1680.0), 75% (1826.25), and max (2062). Here, the min points (183) might be an outlier. You can use box plots to visualize the values and determine possible outliers.

Create box plots for columns

The traditional tool for probing for outlying data values is the box plot. The box in box plot refers to a box drawn around the range of data from the 25th percentile to the 75th percentile. (These percentiles demarcate important quarters of the data. Their range is called the interquartile range.) This box is the middle 50% of the data values for a given variable (a column in a DataFrame). You use another line to mark the median of the data, which is the 50th percentile.

The box plot is also called a box-and-whisker plot because you draw a T shape above and below the box to encompass the maximum and minimum values of the data, excluding outliers. This last part is important for your purposes because it lets you graphically identify outliers.

Ideally, you would produce the box plots for your columns in a single matrix that you can easily scan. Unfortunately, no single function produces multiple box plots, so you'll write a for loop instead.

Because of how the Seaborn library in Python works, you need to explicitly state the cell in the matrix where you want to render each box plot. Use the Python floor-division operator (//) to divide the 13 columns of interest (you don't need to look at ID) into rows. Use the modulo operator (%) to derive the column.

First, import the Matplotlib and Seaborn libraries into your notebook:




#python -m pip install -U pip

#python -m pip install -U matplotlib

#pip install seaborn

#python -m pip install seaborn

In some cases, an installation of seaborn will appear to succeed, but trying to import it will raise an error with the message "No module named seaborn". This usually means that you have multiple Python installations on your system and that your pip or conda points towards a different installation than where your interpreter lives. Resolving this issue will involve sorting out the paths on your system, but it can sometimes be avoided by invoking pip with python -m pip install seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

Now you can:

Create a list of the column names, excluding ID. Use the list to find specific values within each row.

Create a matrix of subplots so you have one figure that shows all 13 columns.

Add padding around the subplots to make them easier to read. 

Create a box plot based on the data in each column, across all of the rows.

In [None]:
# Create a list of all column names, except for ID.
cols = list(player_df.iloc[:, 1:])

# Create a 3x5 matrix of subplots.
fig, axes = plt.subplots(3, 5, figsize=(18, 11))

# Create padding around subplots to make the axis labels readable.
fig.tight_layout(pad=2.0)

# Loop over the columns of the DataFrame and create a box plot for each one.
for i in range(len(cols)):
    sns.boxplot(ax=axes[i//5, i%5], y=player_df[cols[i]])

In [None]:
# Identify the index number of the row that has the lowest value in 'points'.
# Identify OUTLIERS
# you can use the idxmin() method on both columns

points_outlier = player_df['points'].idxmin()
points_outlier

In [None]:
# Identify the index number of the row that has the lowest value in 'possession'.
possession_outlier = player_df['possessions'].idxmin()
possession_outlier

In [None]:
# Drop the row that has the outlying values for 'points' and 'possessions'.
player_df.drop(player_df.index[points_outlier], inplace=True)

# Check the end of the DataFrame to ensure that the correct row was dropped.
player_df.tail(10)

In [None]:
player_df.reset_index(drop=True, inplace=True)

In [None]:
player_df.info()

In [None]:
player_df

In [None]:
player_df.isna().sum()