# Data Exploring & Cleaning
##  Gabriel Becton
Today we will focus on taking a first look at some data sets and cleaning the data.


## Cleaning and Munging a Simple Data Frame

Before working with a large data set, let us first practice with a small amount of data in a simple data frame.  This example comes from [here](https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles-with-solutions.ipynb). The data consists of some made-up flight information.

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df

Some values in the the FlightNumber column are missing. These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Fill in these missing numbers and make the column an integer column (instead of a float column). The pandas `interpolate` function fills in NaNs with interpolated values and is described [here](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html).

In [None]:
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df['FlightNumber']

The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to give a new temporary DataFrame with the correct values. Assign the correct column names to this temporary DataFrame.

In [None]:
temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp

Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame. Standardise the strings so that only the first letter is uppercase (e.g. "LoNDon" should become "London".)  The string method `capitalize()` does just that.

In [None]:
temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()
temp

Delete the From_To column from df and attach the temporary DataFrame.

In [None]:
df = df.drop('From_To', axis=1)
df = df.join(temp)
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,From,To
0,10045,"[23, 47]",KLM(!),London,Paris
1,10055,[],<Air France> (12),Madrid,Milan
2,10065,"[24, 43, 87]",(British Airways. ),London,Stockholm
3,10075,[13],12. Air France,Budapest,Paris
4,10085,"[67, 32]","""Swiss Air""",Brussels,London


In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.

In [None]:
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
# note: using .strip() gets rid of any leading/trailing spaces
df.Airline

In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.

Expand the Series of lists into a DataFrame named delays, rename the columns delay_1, delay_2, etc. and replace the unwanted RecentDelays column in df with delays.

In [None]:
# there are several ways to do this, but the following approach is one of the simplest

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)

The DataFrame should look much better now.

In [None]:
df

Unnamed: 0,FlightNumber,Airline,From,To,delay_1,delay_2,delay_3
0,10045,KLM,London,Paris,23.0,47.0,
1,10055,Air France,Madrid,Milan,,,
2,10065,British Airways,London,Stockholm,24.0,43.0,87.0
3,10075,Air France,Budapest,Paris,13.0,,
4,10085,Swiss Air,Brussels,London,67.0,32.0,


Finally, let's replace the NaNs in the delay columns with zeros.  Take a look at the pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method for a suggestion on how to do this. Also display the modified DataFrame.

In [None]:
# Task 1: Enter code in this cell to change the NaN delays to zeros.  Re-display the modified DataFrame.
df['delay_1'] = df['delay_1'].fillna(value=0)

In [None]:
df['delay_1']

In [None]:
df['delay_2'] = df['delay_2'].fillna(value=0)

In [None]:
df['delay_2']

In [None]:
df['delay_3'] = df['delay_3'].fillna(value=0)

In [None]:
df['delay_3']

## Building Permit Data 

Next we will practice cleaning a larger dataset from Kaggle.  This dataset is described in detail [here](https://www.kaggle.com/aparnashastry/building-permit-applications-data).  Go to that page to look at the column details as well as some summary statistics and histograms. The analysis we will perform will follow closely the one from [here](https://www.kaggle.com/chrisbow/cleaning-data-with-python-challenge-day-1).

The dataset describes building permits issued for San Francisco from Jan 1, 2013 to Feb 25th 2018.  First, we will download the data.

In [None]:
# Replace 'username' with your username from your kaggle.json file
# Replace 'yourkey' with your key from your kaggle.json file
# This key will be a long string of numbers and letters

import os
os.environ['KAGGLE_USERNAME']='gabrielbecton'
os.environ['KAGGLE_KEY']='62a34da3f59850bcad7cf043f0d2a4cd'

In [None]:
! kaggle datasets download -d aparnashastry/building-permit-applications-data

Next, unzip the file and read the contents into a DataFrame.

In [None]:
! unzip building-permit-applications-data.zip

sfPermits = pd.read_csv("Building_Permits.csv")

Next, let us take a first look at the data.  Display some randomly selected rows from our data.  We will first set the random seed so that we get the same rows picked if we re-run the notebook.

In [None]:
np.random.seed(0)
sfPermits.sample(5)

Quite a few missing values visible already, and we've only looked at five rows of the dataset, cleaning will be required...

### Find out what percent of the sf_permit dataset is missing

In [None]:
# Calculate total number of cells in dataframe
totalCells = np.product(sfPermits.shape)

# Count number of missing values per column
missingCount = sfPermits.isnull().sum()

# Calculate total number of missing values
totalMissing = missingCount.sum()

# Calculate percentage of missing values
print("The SF Permits dataset contains", round(((totalMissing/totalCells) * 100), 2), "%", "missing values.")

Look at the columns Street Number Suffix and Zipcode from the sf_permits datasets. Both of these contain missing values. Which, if either, of these are missing because they don't exist? Which, if either, are missing because they weren't recorded?

In [None]:
missingCount[['Street Number Suffix', 'Zipcode']]

Looks like a lot more missing values for street number suffix than zipcode. Let's check out the percentages:

In [None]:
print("Percent missing data in Street Number Suffix column =", (round(((missingCount['Street Number Suffix'] / sfPermits.shape[0]) * 100), 2)))
print("Percent missing data in Zipcode column =", (round(((missingCount['Zipcode'] / sfPermits.shape[0]) * 100), 2)))

As every address has a Zipcode, it looks like the missing values for this column are due to the values not being recorded. For the Street Number Suffix column, it is likely very few properties will have a suffix to the number, I see a lot of 3s, 18s, 46s, but not nearly as many 36A or 18B, so it is likely that these are missing as they don't exist.

### Try removing all the rows from the sf_permits dataset that contain missing values. How many are left?

In [None]:
sfPermits.dropna()

If we drop all rows that contain a missing value, we greatly simplify our dataset. So simple, we can go for an early lunch. Every row contains at least one missing value (well, we know from our Street Number Suffix answer above that simply eliminating those gets rid of nearly 99% of our data), so we end up with a dataframe of column headers.

### Now try removing all the columns with empty values. Now how much of your data is left?

In [None]:
sfPermitsCleanCols = sfPermits.dropna(axis=1)
sfPermitsCleanCols.head()

In [None]:
print("Columns in original dataset: %d \n" % sfPermits.shape[1])
print("Columns with na's dropped: %d" % sfPermitsCleanCols.shape[1])

Well, that gives us a clean set of values, but we've sacrificed a lot of variables in the process...

Try replacing all the NaN's in the sf_permit data with data from the row that comes directly before it and then replace all the remaining na's with 0. Since the building permits in each row are likely unrelated to the ones before and after, this is not an awesome technique but is a simple way to fill in missing data.

In [None]:
imputeSfPermits = sfPermits.fillna(method='ffill', axis=0).fillna("0")

imputeSfPermits.head()

### Calculate the average GPS coordinates for San Francisco building permits

The Location column holds two values per building permit that look like (37.785719256680785, -122.40852313194863).  The first number is the GPS latitude and the second the GPS longitude.  Replace the Location column with two columns, one named GPS Lat and the other names GPS Lon that separately hold the latitude and longitude data. There are multiple methods for doing this, one uses `.str.split`, `.str.replace`, and `.astype('float')`.

In [None]:
# Enter code in this cell to replace the Location column of imputeSFPermits with two columns
# that hold the GPS latitude and longitude, respectively.

temp1 = sfPermits.Location.str.split(',', expand=True)
temp1.columns = ['GPS Lat', 'GPS Lon']
temp1

Finally, calculate the average latitude and average longitude of the San Francisco Permit locations

In [None]:
# Enter code to compute the average lat and lon values.
temp1.isnull().sum()

In [None]:
temp1.dropna()
temp2 = temp1.dropna()

In [None]:
temp2

In [None]:
import statistics
temp3 = temp2.astype(float)

statistics.mean(temp3['GPS Lon'])

In [None]:
statistics.mean(temp3['GPS Lat'])

## IMDB Movie Data

Next we will be inspecting and cleaning a dataset from Kaggle that consists of data scraped from the IMDB website.  The Kaggle site for the data is [here](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset) and we will be following an analysis recommended by [this site](http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/).

First, we will download and unzip the data.

In [None]:
! kaggle datasets download -d carolzhangdc/imdb-5000-movie-dataset

In [None]:
! unzip imdb-5000-movie-dataset.zip

Then read the data into a DataFrame

In [None]:
imdb_df = pd.read_csv('movie_metadata.csv')
imdb_df.head(10)

We can see that there are some missing values and that some movies have very incomplete information (look at row 4, for example).  We can eliminate these rows from our set. With the `dropna` method used below, axis = 0 means go by row (axis = 1 means go by column) and thresh = 20 means drop any rows with fewer than 20 non-NaN values.

In [None]:
numrows_before, _ = imdb_df.shape
imdb_df = imdb_df.dropna(axis=0, thresh=20) 
numrows_after, _ = imdb_df.shape
numrows_removed = numrows_before - numrows_after

print('{} rows removed'.format(numrows_removed))

In [None]:
imdb_df.head(10)

Of course, we could also drop rows with any missing information via `.dropna()` but we would likely lose a lot of our dataset.  If we want to drop rows that contain only missing information values, we can use `.dropna(how='all')` but it is unlikely that we would have rows with absolutely no information.

Let us next investigate which are the most problematic columns.

In [None]:
missingCount = imdb_df.isnull().sum()
missingCount

Some columns contain strings.  It would be appropriate to replace missing string values with an empty string

In [None]:
imdb_df['color'] = imdb_df['color'].fillna('')
missingCount = imdb_df.isnull().sum()
missingCount

In [None]:
# Enter code to replace the rest of the missing string values with empty strings
imdb_df['director_name'] = imdb_df['director_name'].fillna('')

In [None]:
imdb_df['actor_2_name'] = imdb_df['actor_2_name'].fillna('')

In [None]:
imdb_df['actor_1_name'] = imdb_df['actor_1_name'].fillna('')

In [None]:
imdb_df['actor_3_name'] = imdb_df['actor_3_name'].fillna('')

In [None]:
imdb_df['language'] = imdb_df['language'].fillna('')

In [None]:
imdb_df['country'] = imdb_df['country'].fillna('')

In [None]:
imdb_df['plot_keywords'] = imdb_df['plot_keywords'].fillna('')

In [None]:
imdb_df['content_rating'] = imdb_df['content_rating'].fillna('')

Let's say that the year that the movie came out is very important to us and we just don't care about movies where that information is not available.

In [None]:
imdb_df = imdb_df.dropna(subset=['title_year'])

`title_year` may be a confusing column header, so we can rename it.

In [None]:
imdb_df = imdb_df.rename(columns = 5555{'title_year':'release_date'})
imdb_df.columns

Let's make a histogram of release dates.

In [None]:
%matplotlib inline

imdb_df.hist(column='release_date', bins=20)

In [None]:
# Calculate and output the min, max, mean, and standard deviation of the imdb_score.
# Also plot a histogram of the values stored in this column.
import statistics
import numpy as np

imdb_df_mean = statistics.mean(imdb_df['imdb_score'])
imdb_df_min = min(imdb_df['imdb_score'])
imdb_df_max = max(imdb_df['imdb_score'])
imdb_df_sd = statistics.stdev(imdb_df['imdb_score'])

imdb_df.hist(column='imdb_score')

print('the mean of imdb_score is ', imdb_df_mean)
print('the max value of imdb_score is ',imdb_df_max)
print('the min value of imdb_score is ',imdb_df_min)
print('the standard deviation of imdb_score is ',imdb_df_sd)


Looking back at what data is missing, we can see that not every movie has a duration available.  If we want to fill in these missing duration values, a crude way of doing this would be to replace the missing values with the mean movie duration.  Since most movies have durations that fall within a range of times, this would be better than simply setting the missing values to zero.

In [None]:
imdb_df.duration = imdb_df.duration.fillna(imdb_df.duration.mean())

In [None]:
missingCount = imdb_df.isnull().sum()
missingCount

In [None]:
# Replace the missing values in the remaining columns. 
# Re-calculate missingCount to show that you have successfully replaced them.


In [None]:
imdb_df.num_critic_for_reviews = imdb_df.num_critic_for_reviews.fillna(imdb_df.num_critic_for_reviews.mean())

In [None]:
imdb_df.actor_3_facebook_likes = imdb_df.actor_3_facebook_likes.fillna(imdb_df.actor_3_facebook_likes.mean())

In [None]:
imdb_df.actor_1_facebook_likes = imdb_df.actor_1_facebook_likes.fillna(imdb_df.actor_1_facebook_likes.mean())

In [None]:
imdb_df.actor_2_facebook_likes = imdb_df.actor_2_facebook_likes.fillna(imdb_df.actor_2_facebook_likes.mean())

In [None]:
imdb_df.gross = imdb_df.gross.fillna(imdb_df.gross.mean())

In [None]:
imdb_df.facenumber_in_poster = imdb_df.facenumber_in_poster.fillna(imdb_df.facenumber_in_poster.mean())

In [None]:
imdb_df.num_user_for_reviews = imdb_df.num_user_for_reviews.fillna(imdb_df.num_user_for_reviews.mean())

In [None]:
imdb_df.budget = imdb_df.budget.fillna(imdb_df.budget.mean())

In [None]:
imdb_df.aspect_ratio = imdb_df.aspect_ratio.fillna(imdb_df.aspect_ratio.mean())

In [None]:
missingCount = imdb_df.isnull().sum()
missingCount