# Solutions

1. Intro to pandas
1. [The DataFrame and Series](#2.-The-DataFrame-and-Series)
1. [Data Types and Missing Values](#3.-Data-Types-and-Missing-Values)
1. [Setting a meaningful index](#4.-Setting-a-meaningful-index)
1. Five Step Process for Data Exploration

## 2. The DataFrame and Series

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
bikes = pd.read_csv('../data/bikes.csv')

### Exercise 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [2]:
events = bikes['events']
events.head(10)

0    mostlycloudy
1    partlycloudy
2    mostlycloudy
3    mostlycloudy
4    partlycloudy
5    mostlycloudy
6          cloudy
7          cloudy
8          cloudy
9    mostlycloudy
Name: events, dtype: object

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [3]:
# it's a Series
type(events)

pandas.core.series.Series

### Exercise 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [4]:
# it's a DataFrame
bikes_last_2 = bikes.tail(2)
type(bikes_last_2)

pandas.core.frame.DataFrame

## 3. Data Types and Missing Values

### Exercise 1
<span  style="color:green; font-size:16px">What type of object is returned from the **`dtypes`** attribute?</span>

In [5]:
# a Series
type(bikes.dtypes)

pandas.core.series.Series

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is returned from the **`shape`** attribute?</span>

In [6]:
# a tuple of rows, columns
type(bikes.shape)

tuple

### Exercise 3
<span  style="color:green; font-size:16px">What type of object is returned from the **`info`** method?</span>

The object **`None`** is returned. What you see is just output printed to the screen.

In [7]:
info_return = bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
trip_id              50089 non-null int64
usertype             50089 non-null object
gender               50089 non-null object
starttime            50089 non-null object
stoptime             50089 non-null object
tripduration         50089 non-null int64
from_station_name    50089 non-null object
latitude_start       50083 non-null float64
longitude_start      50083 non-null float64
dpcapacity_start     50083 non-null float64
to_station_name      50089 non-null object
latitude_end         50077 non-null float64
longitude_end        50077 non-null float64
dpcapacity_end       50077 non-null float64
temperature          50089 non-null float64
visibility           50089 non-null float64
wind_speed           50089 non-null float64
precipitation        50089 non-null float64
events               50089 non-null object
dtypes: float64(10), int64(2), object(7)
memory usage: 7.3+ MB


In [8]:
type(info_return)

NoneType

### Exercise 4
<span  style="color:green; font-size:16px">The memory usage from the **`info`** method isn't correct when you have objects in your DataFrame. Read the docstrings from it and get the true memory usage.</span>

In [9]:
bikes.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
trip_id              50089 non-null int64
usertype             50089 non-null object
gender               50089 non-null object
starttime            50089 non-null object
stoptime             50089 non-null object
tripduration         50089 non-null int64
from_station_name    50089 non-null object
latitude_start       50083 non-null float64
longitude_start      50083 non-null float64
dpcapacity_start     50083 non-null float64
to_station_name      50089 non-null object
latitude_end         50077 non-null float64
longitude_end        50077 non-null float64
dpcapacity_end       50077 non-null float64
temperature          50089 non-null float64
visibility           50089 non-null float64
wind_speed           50089 non-null float64
precipitation        50089 non-null float64
events               50089 non-null object
dtypes: float64(10), int64(2), object(7)
memory usage: 28.9 MB


## 4. Setting a meaningful index

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be something other than movie title. Are there any other good columns to use as an index?</span>

In [10]:
movies = pd.read_csv('../data/movie.csv', index_col='director_name')
movies.head()

Unnamed: 0_level_0,title,year,color,content_rating,duration,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
James Cameron,Avatar,2009.0,Color,PG-13,178.0,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Gore Verbinski,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Sam Mendes,Spectre,2015.0,Color,PG-13,148.0,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
Christopher Nolan,The Dark Knight Rises,2012.0,Color,PG-13,164.0,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Doug Walker,Star Wars: Episode VII - The Force Awakens,,,,,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


Director name isn't unique. There aren't any other good column names to use as an index.

### Exercise 2
<span  style="color:green; font-size:16px">Use `set_index` to set the index and keep the column as part of the data</span>

In [11]:
movies = pd.read_csv('../data/movie.csv')
movies = movies.set_index('title', drop=False)
movies.head(3)

Unnamed: 0_level_0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Avatar,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


### Exercise 3
<span  style="color:green; font-size:16px">Assign the index of the movie DataFrame that has the titles in the index to its own variable. Output the last 10 movies titles.</span>

In [12]:
index = movies.index
index[-10:]

Index(['Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='title')

### Exercise 4
<span  style="color:green; font-size:16px">Use an integer instead of the column name for **`index_col`** when reading in the data using **`read_csv`**. What does it do?</span>

In [13]:
movies = pd.read_csv('../data/movie.csv', index_col=-5)
movies.head()

Unnamed: 0_level_0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,language,country,budget,imdb_score
plot_keywords,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
avatar|future|marine|native|paraplegic,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,English,USA,237000000.0,7.9
goddess|marriage ceremony|marriage proposal|pirate|singapore,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,English,USA,300000000.0,7.1
bomb|espionage|sequel|spy|terrorist,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,English,UK,245000000.0,6.8
deception|imprisonment|lawlessness|police officer|terrorist plot,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,English,USA,250000000.0,8.5
,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,7.1


It chooses the column name with that integer location.

### Exercise 5
<span  style="color:green; font-size:16px">Use `pd.reset_option('all')` to reset the options to their default values. Test that this worked. </span>

In [14]:
pd.reset_option('all')

html.border has been deprecated, use display.html.border instead
(currently both are identical)


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



(currently both are identical)

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

