# Data Processing with Python and Pandas Part Two


## Today's Topics

* Review last week and subsetting data with query masks
* Data Cleaning
* Data Wrangling
* Working with Time

## A quick review of last week

* Series
* Data Frames
* Index

In [1]:
# Import pandas so we can do stuff
import pandas as pd


### Series

* One-dimensional data structure
* Mother was a list, father was a dictionary
* Dictionary keys become the Series *index*

In [2]:
# create a Series from a list with implicit index
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list)
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
# create a Series from a list with explicit index
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list, index=[1,2,3,"Picksburgh"])
data

1             0.25
2             0.50
3             0.75
Picksburgh    1.00
dtype: float64

* you can create a named index-by-one, but slicing is still index-by-zero 
* and that is why we always you `loc` and `iloc`

In [4]:
# get the item with the index location `
data.iloc[1]

0.5

In [5]:
# get the item with the index name
data.loc[1]

0.25

In [6]:
# get the items at the 2nd and 3rd locations
data.iloc[1:3]

2    0.50
3    0.75
dtype: float64

* Series from python dictionaries

In [7]:
# create a Series from a dictionary where keys become the index
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [8]:
# you can't slice a dictionary
population_dict['California':'Illinois']

TypeError: unhashable type: 'slice'

In [9]:
# but you can slice a Series
population.loc['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [10]:
population.loc['California']

38332521

* Series has a bunch of methods for manipulation data.
* [See the documentation for a list](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [11]:
sorted_population = population.sort_values()
sorted_population

Illinois      12882135
Florida       19552860
New York      19651127
Texas         26448193
California    38332521
dtype: int64

In [12]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [13]:
population.sort_index()

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

### Dataframes

* Two-dimensional data structure
* Made of columns, where each column is a Series
* A spreadsheet, but in Python 

In [14]:
# Quickly create two series with the same index, but different values 
population = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135})
area = pd.Series({'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297})

# now moosh them together into a dataframe
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


* Reading CSV files into Dataframes


In [15]:
# read the data into a pandas dataframe, using the "_id" column for index
order_data  = pd.read_csv("../4 - data management one/chipotle.tsv", sep="\t")
# inspect the dataframe
order_data.head() 

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98


* Writing a Dataframe to a CSV

In [16]:
# Write to a file in current working directory 
# Don't include the row index in output file
order_data.to_csv("chipotle.csv", index=False)

## Subsetting Data

* It is sometimes helpful to think of a Pandas Dataframe as a little database. 
* There is data and information stored in the Pandas Dataframe (or Series) and you want to *retrieve* it.
* Pandas has multiple mechanisms for getting specific bits of data and information from its data structures. 

### Masking: Filtering by Values

* The most common is to use *masking* to select just the rows you want. 
* Masking is a two stage process, first you create a sequence of boolean values based upon a conditional expression--which you can think of as a "query"--and then you index your dataframe using that boolean sequence. 

In [17]:
# Let's look at the chipotle order data
order_data.head(10)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",10.98
6,3,1,Side of Chips,,1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",9.25


In [18]:
# Let's look at all the columns
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.6+ KB


* How might we only look at particular orders?
* First step is to create a *query mask*, a list of `True/False` values for rows that satisfy a particular condition.

In [19]:
# create a query mask for chicken bowls
query_mask = order_data['item_name'] == "Chicken Bowl"

#look at the first 20 items to see what matches
query_mask.head(20)

0     False
1     False
2     False
3     False
4      True
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
19     True
Name: item_name, dtype: bool

* This tells us the row id and True or False if the item type equals chicken bowl
* We can look up that row by index and see if it is correct

In [20]:
order_data.iloc[19]

order_id                                                             10
quantity                                                              1
item_name                                                  Chicken Bowl
choice_description    [Tomatillo Red Chili Salsa, [Fajita Vegetables...
item_price                                                         8.75
Name: 19, dtype: object

* Yup! So now that we know the mask works, we can create a *subset* of our data containing chicken bowls.

In [21]:
chicken_bowls = order_data[query_mask]
chicken_bowls.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",10.98
13,7,1,Chicken Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",11.25
19,10,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",8.75
26,13,1,Chicken Bowl,"[Roasted Chili Corn Salsa (Medium), [Pinto Bea...",8.49


* Now you can do things like calculate the average price for chicken bowl orders

In [22]:
# Calculate the mean price for chicken bowls
chicken_bowls['item_price'].mean()

10.113953168044079

In [23]:
# See how many chicken bowls people order
chicken_bowls['quantity'].value_counts()

1    693
2     31
3      2
Name: quantity, dtype: int64

* We can also combine query masks using boolean logic
* Can we look at just the chicken bowl orders that were less than $10

In [24]:
# create a query mask for chicken bowls
item_query_mask = order_data['item_name'] == "Chicken Bowl"
# create a query mask for cheap orders
price_query_mask = order_data['item_price'] < 10

# apply both query masks using boolean AND
cheap_chicken_bowls = order_data[item_query_mask & price_query_mask]
cheap_chicken_bowls.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
19,10,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",8.75
26,13,1,Chicken Bowl,"[Roasted Chili Corn Salsa (Medium), [Pinto Bea...",8.49
76,34,1,Chicken Bowl,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",8.75
78,34,1,Chicken Bowl,"[Fresh Tomato Salsa, [Rice, Black Beans, Chees...",8.75
99,44,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Rice, Fajita Vege...",8.75


In [25]:
# Median price for cheap chicken bowls
cheap_chicken_bowls['item_price'].median()

8.75

* Query masks can be used to filter and create subsets of data
* Note, this method of subsetting data creates what is called a "view" of the data
* You are basically working with a big slice of the original dataframe, not a separate copy of the data
* This means if you try an do transformations on that view, you will get an error
* For more information, [see the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy)

In [26]:
cheap_chicken_bowls['half_price'] = cheap_chicken_bowls['item_price'] / 2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [27]:
copy_of_cheap_chicken_bowls = cheap_chicken_bowls.copy()
copy_of_cheap_chicken_bowls['half_price'] = copy_of_cheap_chicken_bowls['item_price'] / 2
copy_of_cheap_chicken_bowls.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,half_price
19,10,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",8.75,4.375
26,13,1,Chicken Bowl,"[Roasted Chili Corn Salsa (Medium), [Pinto Bea...",8.49,4.245
76,34,1,Chicken Bowl,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",8.75,4.375
78,34,1,Chicken Bowl,"[Fresh Tomato Salsa, [Rice, Black Beans, Chees...",8.75,4.375
99,44,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Rice, Fajita Vege...",8.75,4.375
