# Welcome to data exploration with Python and Jupyter!

## Here&rsquo;s what to expect:

### Morning

1. [Overview of Jupyter Notebooks](#1.-Overview-of-Jupyter)
2. [Overview of Python syntax with calculation](#2.-Overview-of-Python-syntax)
3. [Refactoring our calculation using Python data structures](#3.-Overview-of-Python-data-structures)

### After First Break

4. [Introducing pandas and cleaning text data](#4.-Introducing-pandas-and-text-cleaning)
5. [Exploration: movie quotes](#5.-Exploration:-movie-quotes)
6. [Introduction to city population data](#6.-Introduction-to-city-population-data)

### Afternoon/After Lunch

7. [City population data prep](#7.-City-population-data-prep)

### After Second Break

8. [Exploration: city population data](#8.-Exploration:-city-population-data)

### Generally:

# <center>CAT GIFS</center>

![yay](https://78.media.tumblr.com/42a41bd1ace113eb410b0005192c2275/tumblr_p8xzms08OS1qhy6c9o1_500.gif)

### Note that this is the reference notebook, so code is already largely entered. It'll be helpful for following along or referring to when you're lost!

# 0. Imports

In [None]:
import pandas as pd
import requests

# 1. Overview of Jupyter

In [None]:
%pwd

In [None]:
%quickref

In [None]:
%who int

# 2. Overview of Python syntax

## How much money does making coffee at home save?

Simple calculation

In [None]:
# multiply coffee and tip to get cost

2 * 1.2

In [None]:
# name variables

latte = 4
drip_coffee = 2
tip = .2

How do the coffee and tip variables differ?

In [None]:
# Check `type`

print(type(latte))
print(type(tip))

# 3. Overview of Python data structures

These include lists, dictionaries, tuples and sets. However, we only need to use the first two today.

#### Create a `list` of coffee preparations that you might order.

In [None]:
coffee_preps = ['latte', 'espresso', 'drip_coffee']

#### Use a `dict` to allow comparison of multiple coffee types 

In [None]:
price_lookup = {'latte': 5, 'espresso': 3, 'drip_coffee': 2}

In [None]:
price_lookup['latte']

How do we figure out the amount of money saved per coffee made at home? For simplicity, let's just do this calculation with drip coffee.

Related figures:

- 1 oz. = 28.3495 grams
- A cup of French press coffee uses about 15 grams of coffee grounds

#### Assign the appropriate values to these variables

In [None]:
gram = 1
ounce = gram * 28.3495
serving = gram * 15
dollars_per_ounce = 0.75

#### Construct a calculation for comparing these:

In [None]:
home_coffee = (serving/ounce) * dollars_per_ounce

In [None]:
home_coffee

In [None]:
drip_coffee/home_coffee

# First Break

![kites](https://78.media.tumblr.com/2c05bccd7e1c9353966e4095dc5caf1b/tumblr_ootmmjLVYj1qhy6c9o1_500.gif)

# 4. Introducing pandas and text cleaning

In [None]:
quotes = pd.read_html('https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movie_Quotes', header = 0)

What did we get back?

In [None]:
# Check type
type(quotes)

Given the data type, how do we check for the information we want?

In [None]:
quotes[2]

In [None]:
df = quotes[2]

How do we know our data is as expected?

In [None]:
df.info()

In [None]:
df.head()

# 5. Exploration: movie quotes

How often does Brando appear?

In [None]:
df.loc[df['Actor/Actress'].str.contains("Brando")]

What year had the greatest number of awards?

In [None]:
# Hint: Start with df.groupby() and use shift + tab to look at its parameters

# df.groupby('Year').count().sort_values(by='Film', ascending=False)

grouped = df.groupby('Year')

In [None]:
grouped

What is the GroupBy object?

You can read more in its documentation [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes).

But you can also explore!

In [None]:
# Use tab to explore the `grouped` object

grouped.

What does it look like to find number of films per year?

In [None]:
grouped.count()

This is a start, but how do we sort what we've found?

In [None]:
grouped_by_count = grouped.count()

Find your method:

In [None]:
grouped_by_count.sort_values(by='Film')

This is useful! But it'd be more useful if it were sorted in descending order.

How do we figure out whether this is an option for the `sort_values` function?

In [None]:
grouped_by_count.sort_values(by='Film', )

# 6. Introduction to city population data

In [None]:
page = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)

How do we find the data we're looking for?

In [None]:
city_df = page[4]

In [None]:
city_df.head()

Before we write any code here, let's talk about what this data can tell us -- and what it cannot.

# Lunch

![lunch](https://78.media.tumblr.com/548cfa1622853c30207245c3214df098/tumblr_oq7r3quYsh1qhy6c9o1_500.gif)

# 7. City population data prep

Let's reacquaint ourselves with the data.

In [None]:
city_df.head()

These columns headers aren't as useful as they could be. How do we fix that?

In [None]:
city_df.rename?

#### We need a dictionary. How did we make one before?

In [None]:
useful_dict = {'foo': 5, 'bar': 'baz'}

This would take a lot of typing! Instead, you can make one by combining two lists.

#### What lists do you need?

In [None]:
current_columns = list(city_df.columns)

In [None]:
current_columns

In [None]:
new_columns = ['2017rank',
               'City',
               'State',
               '2017estimate',
               '2010Census',
               'Change',
               '2016 land area (miles)',
               '2016 land area (km)',
               '2016 population density (miles)',
               '2016 population density (km)',
               'Location']

In [None]:
column_dict = dict(zip(current_columns, new_columns))

In [None]:
city_df.rename(columns=column_dict, inplace=True)

In [None]:
city_df.head()

# Break!

![break2](https://78.media.tumblr.com/ff9287842f26cd9993b8151736992244/tumblr_osj6r9gFH21qhy6c9o1_500.gif)

In [None]:
city_df.head()

# 8. Exploration: city population data

We'll figure out what questions we want to ask in the workshop, and we may not go in this direction. But here's a question you can pursue:

### What was the percent change in population *density* for each city between 2010 and 2016?

Optional: Remove footnote markup from city names

In [None]:
city_df['City'] = city_df['City'].str.replace("\[\d+\]", "", regex=True)

Optional: Drop columns you don't plan to use

In [None]:
# city_df.drop(['2017rank','2017estimate'], axis=1)
city_df

#### Task: Make land area fields usable in calculations

#### Task: Calculate 2010 population density

#### Task: Calculate difference

Bonus: how do these cities' population densities compare to those of cities elsewhere in the world?
    
 https://en.wikipedia.org/wiki/List_of_cities_proper_by_population

# <center> 💖 Fin! 💖</center>
![fin](https://78.media.tumblr.com/7381533edc6941a7ec0d98c71303e3cd/tumblr_p2bac8rqVI1qhy6c9o1_500.gif)