# Cleaning: Cycle Share

There are 3 datasets that provide data on the stations, trips, and weather from 2014-2016.

**Station dataset**

* station_id: station ID number
* name: name of station
* lat: station latitude
* long: station longitude
* install_date: date that station was placed in service
* install_dockcount: number of docks at each station on the installation date
* modification_date: date that station was modified, resulting in a change in location or dock count
* current_dockcount: number of docks at each station on 8/31/2016
* decommission_date: date that station was placed out of service

**Trip dataset**

* trip_id: numeric ID of bike trip taken
* starttime: day and time trip started, in PST
* stoptime: day and time trip ended, in PST
* bikeid: ID attached to each bike
* tripduration: time of trip in seconds
* from_station_name: name of station where trip originated
* to_station_name: name of station where trip terminated
* from_station_id: ID of station where trip originated
* to_station_id: ID of station where trip terminated
* usertype: "Short-Term Pass Holder" is a rider who purchased a 24-Hour or 3-Day Pass; "Member" is a rider who purchased a Monthly or an Annual Membership
* gender: gender of rider
* birthyear: birth year of rider

**Weather dataset** contains daily weather information in the service area

## 1. Import all sets into a dictionary and correct any errors

The trip file had the headers repeated after the values of a line, I simply got rid of them cancelling the values from the file. I also noticed that the first several rows were repeated and the line with the headers was one of those, so I used the values from the original line to fill in the missing ones.

In [1]:
import pandas as pd
import numpy as np

sets = ['station', 'trip', 'weather']

cycle = {}

for s in sets:
    cycle[s] = pd.read_csv('cycle_share/' + s + '.csv')

In [2]:
cycle['trip'].head()

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,gender,birthyear
0,431,10/13/2014 10:31,10/13/2014 10:48,SEA00298,985.935,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1960.0
1,432,10/13/2014 10:32,10/13/2014 10:48,SEA00195,926.375,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1970.0
2,433,10/13/2014 10:33,10/13/2014 10:48,SEA00486,883.831,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1988.0
3,434,10/13/2014 10:34,10/13/2014 10:48,SEA00333,865.937,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1977.0
4,435,10/13/2014 10:34,10/13/2014 10:49,SEA00202,923.923,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1971.0


## 2. Print data summaries including the number of null values. Should we drop or try to correct any of the null values?

In [3]:
for df in cycle:
    print(df)
    print(cycle[df].describe(include='all'))
    print('\n')

station
       station_id                   name        lat        long install_date  \
count          58                     58  58.000000   58.000000           58   
unique         58                     58        NaN         NaN            9   
top         PS-04  9th Ave N & Mercer St        NaN         NaN   10/13/2014   
freq            1                      1        NaN         NaN           50   
mean          NaN                    NaN  47.624796 -122.327242          NaN   
std           NaN                    NaN   0.019066    0.014957          NaN   
min           NaN                    NaN  47.598488 -122.355230          NaN   
25%           NaN                    NaN  47.613239 -122.338735          NaN   
50%           NaN                    NaN  47.618591 -122.328207          NaN   
75%           NaN                    NaN  47.627712 -122.316691          NaN   
max           NaN                    NaN  47.666145 -122.284119          NaN   

        install_dockcount modif

Gender and year of birth have nulls, I don't think we should drop them because we would lose over 100000 rows; instead we could use the median or mean to replace nulls for the year of birth. Regarding the gender it's not possible to make any replacement, but it should be noted that most of the entries are male.

In [5]:
cycle['trip'].groupby('gender')['trip_id'].count()

gender
Female     37562
Male      140564
Other       3431
Name: trip_id, dtype: int64

## 3. Create a column in the trip table that contains only the date (no time)

In [39]:
#cycle['trip']['date'] = cycle['trip']['starttime'].apply(lambda x: pd.to_datetime(x[0:x.find(' ')], format='%m/%d/%Y'))
cycle['trip']['date'] = cycle['trip']['starttime'].apply(lambda x: x[0:x.find(' ')])
cycle['trip'].head()

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,gender,birthyear,date
0,431,10/13/2014 10:31,10/13/2014 10:48,SEA00298,985.935,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1960.0,10/13/2014
1,432,10/13/2014 10:32,10/13/2014 10:48,SEA00195,926.375,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1970.0,10/13/2014
2,433,10/13/2014 10:33,10/13/2014 10:48,SEA00486,883.831,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1988.0,10/13/2014
3,434,10/13/2014 10:34,10/13/2014 10:48,SEA00333,865.937,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1977.0,10/13/2014
4,435,10/13/2014 10:34,10/13/2014 10:49,SEA00202,923.923,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1971.0,10/13/2014


## 4. Merge weather data with trip data and be sure not to lose any trip data

In [40]:
trip_weather = pd.merge(cycle['trip'], cycle['weather'], left_on='date', right_on='Date', how='left')
trip_weather.head()

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,...,Mean_Sea_Level_Pressure_In,Min_Sea_Level_Pressure_In,Max_Visibility_Miles,Mean_Visibility_Miles,Min_Visibility_Miles,Max_Wind_Speed_MPH,Mean_Wind_Speed_MPH,Max_Gust_Speed_MPH,Precipitation_In,Events
0,431,10/13/2014 10:31,10/13/2014 10:48,SEA00298,985.935,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,29.79,29.65,10,10,4,13,4,21,0.0,Rain
1,432,10/13/2014 10:32,10/13/2014 10:48,SEA00195,926.375,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,29.79,29.65,10,10,4,13,4,21,0.0,Rain
2,433,10/13/2014 10:33,10/13/2014 10:48,SEA00486,883.831,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,29.79,29.65,10,10,4,13,4,21,0.0,Rain
3,434,10/13/2014 10:34,10/13/2014 10:48,SEA00333,865.937,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,29.79,29.65,10,10,4,13,4,21,0.0,Rain
4,435,10/13/2014 10:34,10/13/2014 10:49,SEA00202,923.923,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,29.79,29.65,10,10,4,13,4,21,0.0,Rain


## 5. Drop records that are completely duplicated (all values). Check for and inspect any duplicate trip_id values that remain. Remove if they exist.

In [68]:
print(len(trip_weather))
trip_weather = trip_weather.drop_duplicates()
print(len(trip_weather))

286858
236065


In [72]:
print(len(trip_weather['trip_id']))
print(len(trip_weather['trip_id'].unique()))

236065
236065


## 6. Create columns for lat & long values for the from- and to- stations

In [80]:
trip_weather = pd.merge(trip_weather, cycle['station'][['station_id', 'lat', 'long']], left_on='from_station_id', right_on='station_id', how='left').drop('station_id', axis=1)
trip_weather = pd.merge(trip_weather, cycle['station'][['station_id', 'lat', 'long']], left_on='to_station_id', right_on='station_id', how='left', suffixes=['_from_station', '_to_station']).drop('station_id', axis=1)
trip_weather.head()

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,...,Min_Visibility_Miles,Max_Wind_Speed_MPH,Mean_Wind_Speed_MPH,Max_Gust_Speed_MPH,Precipitation_In,Events,lat_from_station,long_from_station,lat_to_station,long_to_station
0,431,10/13/2014 10:31,10/13/2014 10:48,SEA00298,985.935,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,4,13,4,21,0.0,Rain,47.60595,-122.335768,47.600757,-122.332946
1,432,10/13/2014 10:32,10/13/2014 10:48,SEA00195,926.375,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,4,13,4,21,0.0,Rain,47.60595,-122.335768,47.600757,-122.332946
2,433,10/13/2014 10:33,10/13/2014 10:48,SEA00486,883.831,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,4,13,4,21,0.0,Rain,47.60595,-122.335768,47.600757,-122.332946
3,434,10/13/2014 10:34,10/13/2014 10:48,SEA00333,865.937,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,4,13,4,21,0.0,Rain,47.60595,-122.335768,47.600757,-122.332946
4,435,10/13/2014 10:34,10/13/2014 10:49,SEA00202,923.923,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,...,4,13,4,21,0.0,Rain,47.60595,-122.335768,47.600757,-122.332946


## 7. Write a function to round all `tripduration` values to the nearest half second increment and then round all the values in the data

In [129]:
def round_trips(duration):
    roundings = np.array([np.floor(duration), np.floor(duration)+0.5, np.ceil(duration)])
    return roundings[np.argmin(np.abs(duration - roundings))]

trip_weather['tripduration'] = trip_weather['tripduration'].apply(round_trips)

In [130]:
trip_weather['tripduration'].head(10)

0    986.0
1    926.5
2    884.0
3    866.0
4    924.0
5    809.0
6    596.5
7    592.0
8    586.5
9    587.5
Name: tripduration, dtype: float64

## 8. Verify that `trip_duration` matches the timestamps to within 60 seconds

In [140]:
trip_weather[np.abs(((pd.to_datetime(trip_weather['stoptime']) - pd.to_datetime(trip_weather['starttime'])) / np.timedelta64(1, 's')) - trip_weather['tripduration']) > 60]

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,...,Min_Visibility_Miles,Max_Wind_Speed_MPH,Mean_Wind_Speed_MPH,Max_Gust_Speed_MPH,Precipitation_In,Events,lat_from_station,long_from_station,lat_to_station,long_to_station
7040,8660,11/2/2014 1:29,11/2/2014 1:12,SEA00384,2571.0,Pine St & 9th Ave,Westlake Ave & 6th Ave,SLU-16,SLU-15,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.613715,-122.331777,47.613628,-122.337341
7041,8661,11/2/2014 1:29,11/2/2014 1:35,SEA00371,3986.0,Cal Anderson Park / 11th Ave & Pine St,Cal Anderson Park / 11th Ave & Pine St,CH-08,CH-08,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.615486,-122.318245,47.615486,-122.318245
7042,8662,11/2/2014 1:29,11/2/2014 1:35,SEA00170,3978.5,Cal Anderson Park / 11th Ave & Pine St,Cal Anderson Park / 11th Ave & Pine St,CH-08,CH-08,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.615486,-122.318245,47.615486,-122.318245
7043,8663,11/2/2014 1:29,11/2/2014 1:11,SEA00205,2513.5,Pine St & 9th Ave,Westlake Ave & 6th Ave,SLU-16,SLU-15,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.613715,-122.331777,47.613628,-122.337341
7044,8666,11/2/2014 1:31,11/2/2014 1:11,SEA00430,2398.5,Pine St & 9th Ave,Westlake Ave & 6th Ave,SLU-16,SLU-15,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.613715,-122.331777,47.613628,-122.337341
7045,8667,11/2/2014 1:37,11/2/2014 1:12,SEA00112,2074.0,Pine St & 9th Ave,Westlake Ave & 6th Ave,SLU-16,SLU-15,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.613715,-122.331777,47.613628,-122.337341
7046,8669,11/2/2014 1:44,11/2/2014 1:04,SEA00247,1201.5,2nd Ave & Vine St,Key Arena / 1st Ave N & Harrison St,BT-03,SLU-19,Member,...,4,9,5,-,0.11,Rain,47.615829,-122.348564,47.622277,-122.35523
7047,8670,11/2/2014 1:52,11/2/2014 1:07,SEA00460,919.0,2nd Ave & Vine St,Summit Ave & E Denny Way,BT-03,CH-01,Member,...,4,9,5,-,0.11,Rain,47.615829,-122.348564,47.618633,-122.325249
7048,8672,11/2/2014 1:59,11/2/2014 1:25,SEA00481,1565.0,UW Magnuson Health Sciences Center Rotunda / C...,Children's Hospital / Sandpoint Way NE & 40th ...,UW-10,DPD-03,Short-Term Pass Holder,...,4,9,5,-,0.11,Rain,47.650725,-122.311188,47.663509,-122.284119


## 9.Something is wrong with the `Max_Gust_Speed_MPH` column. Identify and correct the problem, then save the data.

In [154]:
# not an int, let's convert it
trip_weather['Max_Gust_Speed_MPH'] = trip_weather['Max_Gust_Speed_MPH'].replace('-', np.NaN).astype('float')

In [155]:
trip_weather['Max_Gust_Speed_MPH'].describe()

count    88509.000000
mean        21.287474
std          5.169371
min         16.000000
25%         18.000000
50%         20.000000
75%         24.000000
max         52.000000
Name: Max_Gust_Speed_MPH, dtype: float64

In [156]:
trip_weather.to_csv('cycle_share/trip_weather.csv')

# Cleaning: Movies

This data set contains 28 attributes related to various movie titles that have been scraped from IMDb. The set is supposed to contain unique titles for each record, where each record has the following attributes:

"movie_title" "color" "num_critic_for_reviews" "movie_facebook_likes" "duration" "director_name" "director_facebook_likes" "actor_3_name" "actor_3_facebook_likes" "actor_2_name" "actor_2_facebook_likes" "actor_1_name" "actor_1_facebook_likes" "gross" "genres" "num_voted_users" "cast_total_facebook_likes" "facenumber_in_poster" "plot_keywords" "movie_imdb_link" "num_user_for_reviews" "language" "country" "content_rating" "budget" "title_year" "imdb_score" "aspect_ratio"

The original set is available kaggle ([here](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset))

## 1. Check for and correct similar values in `color`, `language`,  and `country`

In [157]:
movies = pd.read_csv('movies/movies_data.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [224]:
movies.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [159]:
movies.describe(include='all')

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,5024,4939,4993.0,5028.0,4939.0,5020.0,5030,5036.0,4159.0,5043,...,5022.0,5031,5038,4740,4551.0,4935.0,5030.0,5043.0,4714.0,5043.0
unique,4,2411,,,,,3040,,,914,...,,47,65,18,,,,,,
top,Color,Steven Spielberg,,,,,Morgan Freeman,,,Drama,...,,English,USA,R,,,,,,
freq,4799,25,,,,,20,,,236,...,,4704,3807,2118,,,,,,
mean,,,140.194272,107.201074,686.509212,645.009761,,6560.047061,48468410.0,,...,272.770808,,,,39752620.0,2002.470517,1651.754473,6.442138,2.220403,7525.964505
std,,,121.601675,25.197441,2813.328607,1665.041728,,15020.75912,68452990.0,,...,377.982886,,,,206114900.0,12.474599,4042.438863,1.125116,1.385113,19320.44511
min,,,1.0,7.0,0.0,0.0,,0.0,162.0,,...,1.0,,,,218.0,1916.0,0.0,1.6,1.18,0.0
25%,,,50.0,93.0,7.0,133.0,,614.0,5340988.0,,...,65.0,,,,6000000.0,1999.0,281.0,5.8,1.85,0.0
50%,,,110.0,103.0,49.0,371.5,,988.0,25517500.0,,...,156.0,,,,20000000.0,2005.0,595.0,6.6,2.35,166.0
75%,,,195.0,118.0,194.5,636.0,,11000.0,62309440.0,,...,326.0,,,,45000000.0,2011.0,918.0,7.2,2.35,3000.0


In [180]:
print(movies['color'].unique())
movies['color'] = movies['color'].apply(lambda x: 'Color' if x == 'color' else 'Black and White' if x == 'black and white' else x)
print(movies['color'].unique())

['Color' nan 'Black and White']
['Color' nan 'Black and White']


## 2. Create a function that detects and lists non-numeric columns containing values with leading or trailing whitespace. Remove the whitespace in these columns.

In [210]:
def find_spaces(df):
    cols = []
    for index, value in df.dtypes[df.dtypes == 'object'].iteritems():
        if df[index].str.startswith(' ').any() | df[index].str.endswith(' ').any():
            cols.append(index)
    
    return cols

find_spaces(movies)

['director_name', 'actor_2_name', 'movie_title']

In [217]:
for col in find_spaces(movies):
    movies[col] = movies[col].str.lstrip().str.rstrip()

find_spaces(movies)

[]

## 3. Remove duplicate records. Inspect any remaining duplicate movie titles.

In [221]:
print(len(movies))
movies = movies.drop_duplicates()
print(len(movies))

4998
4998


In [252]:
title_duplicates = list(movies['movie_title'].value_counts()[movies['movie_title'].value_counts() > 1].index)

movies[movies['movie_title'].isin(title_duplicates)].sort_values(by='movie_title')

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
4894,Color,Richard Fleischer,69.0,127.0,130.0,51.0,Robert J. Wilke,618.0,,Adventure|Drama|Family|Fantasy|Sci-Fi,...,108.0,English,USA,Approved,5000000.0,1954.0,53.0,7.2,1.37,0
3711,Color,Richard Fleischer,69.0,127.0,130.0,51.0,Robert J. Wilke,617.0,,Adventure|Drama|Family|Fantasy|Sci-Fi,...,108.0,English,USA,Approved,5000000.0,1954.0,53.0,7.2,1.37,0
1420,Color,Wes Craven,256.0,101.0,0.0,574.0,Lin Shaye,40000.0,26505000.0,Horror,...,668.0,English,USA,X,1800000.0,1984.0,852.0,7.5,1.85,10000
4352,Color,Wes Craven,256.0,101.0,0.0,574.0,Lin Shaye,40000.0,26505000.0,Horror,...,668.0,English,USA,X,1800000.0,1984.0,852.0,7.5,1.85,10000
1113,Color,Julie Taymor,156.0,133.0,278.0,107.0,T.V. Carpio,5000.0,24343673.0,Drama|Fantasy|Musical|Romance,...,524.0,English,USA,PG-13,45000000.0,2007.0,117.0,7.4,2.35,14000
4842,Color,Julie Taymor,156.0,133.0,278.0,107.0,T.V. Carpio,5000.0,24343673.0,Drama|Fantasy|Musical|Romance,...,524.0,English,USA,PG-13,45000000.0,2007.0,117.0,7.4,2.35,14000
4128,Color,Tim Burton,451.0,108.0,13000.0,11000.0,Alan Rickman,40000.0,334185206.0,Adventure|Family|Fantasy,...,736.0,English,USA,PG,200000000.0,2010.0,25000.0,6.5,1.85,24000
33,Color,Tim Burton,451.0,108.0,13000.0,11000.0,Alan Rickman,40000.0,334185206.0,Adventure|Family|Fantasy,...,736.0,English,USA,PG,200000000.0,2010.0,25000.0,6.5,1.85,24000
1389,Color,Cameron Crowe,138.0,105.0,488.0,13000.0,Bradley Cooper,15000.0,20991497.0,Comedy|Drama|Romance,...,172.0,English,USA,PG-13,37000000.0,2015.0,14000.0,5.4,1.85,11000
2639,Color,Cameron Crowe,138.0,105.0,488.0,13000.0,Bradley Cooper,15000.0,20991497.0,Comedy|Drama|Romance,...,172.0,English,USA,PG-13,37000000.0,2015.0,14000.0,5.4,1.85,11000


In [256]:
print(movies.loc[337])
print(movies.loc[4584])

color                                                                    Color
director_name                                                    Peter Jackson
num_critic_for_reviews                                                     308
duration                                                                   135
director_facebook_likes                                                      0
actor_3_facebook_likes                                                     310
actor_2_name                                                       AJ Michalka
actor_1_facebook_likes                                                     873
gross                                                              4.39828e+07
genres                                                  Drama|Fantasy|Thriller
actor_1_name                                                 Michael Imperioli
movie_title                                                   The Lovely Bones
num_voted_users                                     

## 4. Create a function that returns two arrays: one for titles that are truly duplicated, and  one for duplicated titles are not the same movie.
* hint: do this by comparing the imdb link values

In [269]:
true_dup = []
false_dup = []

for title in title_duplicates:
    for index, value in movies[movies['movie_title'] == title]['movie_imdb_link'].value_counts().iteritems():
        if value > 1:
            true_dup.append(title)
        else:
            false_dup.append(title)
            break

print(true_dup)
print(false_dup)

['Home', 'Ben-Hur', 'King Kong', 'The Watch', 'Clash of the Titans', 'A Nightmare on Elm Street', '20,000 Leagues Under the Sea', 'Poltergeist', 'The Gambler', 'Planet of the Apes', 'Ghostbusters', 'Spider-Man 3', 'Jack Reacher', 'The Omen', 'Day of the Dead', 'Dredd', 'Murder by Numbers', 'Juno', 'Precious', 'Halloween', 'Skyfall', 'The Fog', 'The Gift', 'The Fast and the Furious', 'Cinderella', 'Glory', 'Creepshow', 'Lolita', "The Astronaut's Wife", 'Casino Royale', 'Mercury Rising', 'The Great Gatsby', 'Syriana', 'Disturbia', 'Unknown', 'Brothers', 'The Jungle Book', 'The Tourist', 'Point Break', 'The Island', 'Pan', 'Aloha', 'The Lovely Bones', 'RoboCop', 'First Blood', 'Twilight', 'Conan the Barbarian', 'The Karate Kid', 'Around the World in 80 Days', 'Goosebumps', 'The Return of the Living Dead', 'Snitch', 'Dodgeball: A True Underdog Story', 'Exodus: Gods and Kings', 'The Day the Earth Stood Still', 'Oz the Great and Powerful', 'The Last House on the Left', 'Dawn of the Dead', 'S

## 5. Alter the names of duplicate titles that are different movies so each is unique. Then drop all duplicate rows based on movie title.

In [299]:
movies['movie_title'] = movies.apply(lambda x: x['movie_title'] + ' (' + str(int(x['title_year'])) + ')' if str(x['title_year']) != 'nan' and x['movie_title'] in false_dup else x['movie_title'], axis=1)

In [303]:
print(len(movies))
movies = movies.drop_duplicates('movie_title')
print(len(movies))

4998
4919


## 6. Create a series that ranks actors by proportion of movies they have appeared in

In [322]:
actors = movies.groupby(['actor_1_name'])['movie_title'].count()
actors = actors.add(movies.groupby(['actor_2_name'])['movie_title'].count(), fill_value=0)
actors = actors.add(movies.groupby(['actor_3_name'])['movie_title'].count(), fill_value=0)

(actors / len(movies)).sort_values(ascending=False).head(20)

Robert De Niro        0.010775
Morgan Freeman        0.008742
Bruce Willis          0.007725
Matt Damon            0.007522
Johnny Depp           0.007319
Steve Buscemi         0.007319
Nicolas Cage          0.006709
Brad Pitt             0.006709
Bill Murray           0.006505
Will Ferrell          0.006505
Liam Neeson           0.006505
Denzel Washington     0.006302
Anthony Hopkins       0.006099
Jim Broadbent         0.005896
J.K. Simmons          0.005896
Harrison Ford         0.005896
Robert Downey Jr.     0.005692
Tom Cruise            0.005692
Tom Hanks             0.005692
Scarlett Johansson    0.005489
Name: movie_title, dtype: float64

## 7. Create a table that contains the first and last years each actor appeared, and their length of history. Then include columns for the actors proportion and total number of movies.
* length is number of years they have appeared in movies

In [353]:
actor_years = movies.groupby(['actor_1_name'])['title_year'].aggregate({'min_year_1': np.min, 'max_year_1': np.max})
actor_years = actor_years.add(movies.groupby(['actor_2_name'])['title_year'].aggregate({'min_year_2': np.min, 'max_year_2': np.max}), fill_value=0)
actor_years = actor_years.add(movies.groupby(['actor_3_name'])['title_year'].aggregate({'min_year_3': np.min, 'max_year_3': np.max}), fill_value=0)

actor_years['first_year'] = np.min(actor_years[['min_year_1', 'min_year_2', 'min_year_3']], axis=1)
actor_years['last_year'] = np.max(actor_years[['max_year_1', 'max_year_2', 'max_year_3']], axis=1)

actor_years.drop(['min_year_1', 'min_year_2', 'min_year_3', 'max_year_1', 'max_year_2', 'max_year_3'], axis=1, inplace=True)

actor_years['history_length'] = actor_years['last_year'] - actor_years['first_year']

actor_years['movie_number'] = actors
actor_years['movie_proportion'] = actors / len(movies)

actor_years

Unnamed: 0,first_year,last_year,history_length,movie_number,movie_proportion
50 Cent,2005.0,2015.0,10.0,5.0,0.001016
A. Michael Baldwin,1988.0,1988.0,0.0,1.0,0.000203
A.J. Buckley,2000.0,2015.0,15.0,5.0,0.001016
A.J. DeLucia,2015.0,2015.0,0.0,1.0,0.000203
A.J. Langer,1998.0,1998.0,0.0,1.0,0.000203
AJ Michalka,2009.0,2011.0,2.0,2.0,0.000407
Aaliyah,2000.0,2002.0,2.0,2.0,0.000407
Aaron Ashmore,2004.0,2015.0,11.0,2.0,0.000407
Aaron Hill,,,,1.0,0.000203
Aaron Hughes,2007.0,2007.0,0.0,1.0,0.000203


## 8. Create a column that gives each movie an integer ranking based on gross sales
* 1 should indicate the highest gross
* If more than one movie has equal sales, assign all the lowest rank in the group
* The next rank after this group should increase only by 1

In [372]:
movies['gross_sales_rank'] = movies['gross'].rank(method='dense', ascending=False, na_option='bottom')
movies[['movie_title', 'gross', 'gross_sales_rank']].sort_values(by='gross_sales_rank').head(20)

Unnamed: 0,movie_title,gross,gross_sales_rank
0,Avatar,760505847.0,1.0
26,Titanic,658672302.0,2.0
29,Jurassic World,652177271.0,3.0
17,The Avengers,623279547.0,4.0
66,The Dark Knight,533316061.0,5.0
240,Star Wars: Episode I - The Phantom Menace,474544677.0,6.0
3024,Star Wars: Episode IV - A New Hope,460935665.0,7.0
8,Avengers: Age of Ultron,458991599.0,8.0
3,The Dark Knight Rises,448130642.0,9.0
582,Shrek 2,436471036.0,10.0
