# Introduction to Python for Social Science
- UMN LATIS workshop, Oct 16, 2020
- Michael Beckstrand (mjbeckst@umn.edu) and Cody Hennesy (chennesy@umn.edu)

* Use Python 3 in a JupyterLab computing environment
* Use Python to grab data from a large number of files quickly
* Load a comma-delimited spreadsheet (.csv) into Pandas as a dataframe
* View and clean that data
* Save cleaned data file in formats for later use

### Jupyter lab environment
- How to navigate the directory of folders
- How to create a cell
- How to run a cell

### Variables
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations and save text strings in variables too.

In [27]:
weight_1 = 140
print(weight_1)

140


In [28]:
weight_2 = 60
print(weight_1 + weight_2)

200


In [29]:
sentence = "This is a string of text."
print(sentence)

This is a string of text.


### Types
Everything in Python is some type of object. Objects contain attributes, usually data and related functions,
called methods.

In [30]:
print(type(weight_1))
print(type(sentence))
print(type(140.0))
print(type(print))

<class 'int'>
<class 'str'>
<class 'float'>
<class 'builtin_function_or_method'>


### Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [31]:
my_list = ['winter', 'sprung', 'summer', 'fall'] # this list has a typo
print(my_list)

['winter', 'sprung', 'summer', 'fall']


Note that we used a # symbol to leave a comment in the code above. Anything that follows the # symbol on the same line will be "commented out" of the code, so Python will not try to interpret it. The # symbol can be used to create a comment at the beginning of a new line too.

In [32]:
# let's check the type of the list we just created
type(my_list)

list

To fetch the element at a specific location, put the *index* of that location in square brackets. But keep in mind that Python lists start the index from 0. So the list above has three index values: ```my_list[0] my_list[1] my_list[2]```

In [33]:
print(my_list[1])

sprung


In [34]:
my_list[1] = 'spring'
print(my_list[1])

spring


Lists can contain different types of Python objects, and you can even create lists of lists.

In [35]:
mixed_list = ['word', 3, 10.2, ['list', 'of', 'items']]
print(mixed_list[3])
print(type(mixed_list[3]))

['list', 'of', 'items']
<class 'list'>


Using similar syntax as a list index, you can look at slices of parts of a list:

In [36]:
mixed_list[1:3]

[3, 10.2]

In [37]:
mixed_list[2:]

[10.2, ['list', 'of', 'items']]

In [38]:
mixed_list[:2]

['word', 3]

Sometime you'll want to create an empty list (which you can fill later). You can do so, by declaring a list variable with empty square brackets.

In [39]:
future_list = []

Now let's create a list of four numbers and assign it to the variable 'a'.

In [40]:
a = [1, 2, 3, 4]

Create another variable 'b' that references the variable you created in the previous step.

In [41]:
b = a

Now change the first item in the list that 'b' references.

In [42]:
b[0] = 7

Look at a and b. Are they the same or did the change you made to 'b' in the previous step also change 'a'?

In [43]:
a

[7, 2, 3, 4]

In [44]:
b

[7, 2, 3, 4]

In [45]:
b = a.copy()

In [46]:
b[0] = 1

In [47]:
a

[7, 2, 3, 4]

In [48]:
b

[1, 2, 3, 4]

## Importing Libraries (aka packages)
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.


Python has a huge number of built-in packages, its standard library. 
Some methods are always available; others we must import.

In [49]:
rand = random.randint(1,10)

In [50]:
import random

In [51]:
rand = random.randint(1,10)
print(rand)

1


### Loading tabular data (such as a CSV) into Pandas
- Use the Pandas library to do statistics on tabular data.
- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Borrows many features from R’s dataframes.
 - A 2-dimensional table whose columns have names and potentially have different data types.
- Load it with `import pandas as pd`. The alias `pd` is commonly used for Pandas.

In [52]:
import pandas as pd

Read a Comma Separate Values (CSV) data file with `pd.read_csv`.
- Argument is the name of the file to be read.
- Assign result to a variable to store the data that was read.

In [53]:
df = pd.read_csv("bechdel_data.csv")

In [54]:
#this is not helpful
print(df)

      year                                 imdb  \
0     2013  http://www.imdb.com/title/tt1711425   
1     2012  http://www.imdb.com/title/tt1343727   
2     2013  http://www.imdb.com/title/tt2024544   
3     2013  http://www.imdb.com/title/tt1272878   
4     2013  http://www.imdb.com/title/tt0453562   
5     2013  http://www.imdb.com/title/tt1335975   
6     2013  http://www.imdb.com/title/tt1606378   
7     2013  http://www.imdb.com/title/tt2194499   
8     2013  http://www.imdb.com/title/tt1814621   
9     2013  http://www.imdb.com/title/tt1815862   
10    2013  http://www.imdb.com/title/tt1800241   
11    2013  http://www.imdb.com/title/tt1322269   
12    2013  http://www.imdb.com/title/tt1559547   
13    2013  http://www.imdb.com/title/tt2334873   
14    2013  http://www.imdb.com/title/tt1535109   
15    2013  http://www.imdb.com/title/tt1939659   
16    2013  http://www.imdb.com/title/tt1985966   
17    2013  http://www.imdb.com/title/tt1690953   
18    2013  http://www.imdb.com

### Examining your data
After reading in the data, do a quick check to see what it looks like. Start by looking at the first 5 lines of the data frame.

In [55]:
df.head(5)

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
0,2013,http://www.imdb.com/title/tt1711425,21 & Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0
1,2012,http://www.imdb.com/title/tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0
2,2013,http://www.imdb.com/title/tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0
3,2013,http://www.imdb.com/title/tt1272878,2 Guns,notalk,notalk,FAIL,61000000,75612460.0,132493015.0,2013FAIL,61000000,75612460.0,132493015.0
4,2013,http://www.imdb.com/title/tt0453562,42,men,men,FAIL,40000000,95020213.0,95020213.0,2013FAIL,40000000,95020213.0,95020213.0


Now look at the last 5 lines.

In [56]:
df.tail(5)

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
1789,1971,http://www.imdb.com/title/tt0067741,Shaft,notalk,notalk,FAIL,53012938,70327868.0,107190108.0,1971FAIL,305063707,404702718.0,616827003.0
1790,1971,http://www.imdb.com/title/tt0067800,Straw Dogs,notalk,notalk,FAIL,25000000,10324441.0,11253821.0,1971FAIL,143862856,59412143.0,64760273.0
1791,1971,http://www.imdb.com/title/tt0067116,The French Connection,notalk,notalk,FAIL,2200000,41158757.0,41158757.0,1971FAIL,12659931,236848653.0,236848653.0
1792,1971,http://www.imdb.com/title/tt0067992,Willy Wonka & the Chocolate Factory,men-disagree,men,FAIL,3000000,4000000.0,4000000.0,1971FAIL,17263543,23018057.0,23018057.0
1793,1970,http://www.imdb.com/title/tt0065466,Beyond the Valley of the Dolls,ok,ok,PASS,1000000,9000000.0,9000000.0,1970PASS,5997631,53978683.0,53978683.0


You can look at specific rows of the data frame using the same syntax that we used to slice a list.

In [57]:
df[100:105]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
100,2012,http://www.imdb.com/title/tt1611224,Abraham Lincoln: Vampire Hunter,dubious-disagree,dubious,FAIL,67500000,37519139.0,115119139.0,2012FAIL,68488103,38068365.0,116804318.0
101,2012,http://www.imdb.com/title/tt1591479,Act of Valor,nowomen,nowomen,FAIL,12000000,70012847.0,82497035.0,2012FAIL,12175663,71037735.0,83704673.0
102,2012,http://www.imdb.com/title/tt1602620,Amour,dubious-disagree,dubious,FAIL,9700000,6738954.0,25915719.0,2012FAIL,9841994,6837603.0,26295088.0
103,2012,http://www.imdb.com/title/tt1781769,Anna Karenina,men-disagree,men,FAIL,49000000,12816367.0,70735540.0,2012FAIL,49717290,13003980.0,71771007.0
104,2012,http://www.imdb.com/title/tt1024648,Argo,ok-disagree,ok,PASS,44500000,136025503.0,221300694.0,2012PASS,45151416,138016721.0,224540219.0


Finally, look at the metadata of the data frame.

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1794 entries, 0 to 1793
Data columns (total 13 columns):
year             1794 non-null int64
imdb             1794 non-null object
title            1794 non-null object
test             1794 non-null object
clean_test       1794 non-null object
binary           1794 non-null object
budget           1794 non-null int64
domgross         1777 non-null float64
intgross         1783 non-null float64
code             1794 non-null object
budget_2013      1794 non-null int64
domgross_2013    1776 non-null float64
intgross_2013    1783 non-null float64
dtypes: float64(4), int64(3), object(6)
memory usage: 182.3+ KB


In [59]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013', 'domgross_2013',
       'intgross_2013'],
      dtype='object')

Now let's see what types of data we have in our dataset. What type of data is in the 'year' column in the dataset?

In [60]:
type(df['year'])

pandas.core.series.Series

There are some nice built-in functions for giving summary statistics of dataframes, their columns and rows.

Let's look at the year column:

In [61]:
df['year'].describe()

count    1794.000000
mean     2002.552397
std         8.979731
min      1970.000000
25%      1998.000000
50%      2005.000000
75%      2009.000000
max      2013.000000
Name: year, dtype: float64

You can also apply these statistical functions directly to the data in a column:

In [62]:
df['year'].max()

2013

Let's look at the 'binary' column.

In [63]:
type(df['binary'])

pandas.core.series.Series

Since this is a series (which operates a little bit like a list) we can use an index value to look at the first value (remember the index starts at zero). 

The most basic way to select subsets of the data is done using the `[ ]` operator. For a DataFrame, this works as the Series corresponding to the column label.

In [64]:
type(df['binary'][0])

str

In [65]:
df['binary'][0]

'FAIL'

In [66]:
df['binary'][0:10]

0    FAIL
1    PASS
2    FAIL
3    FAIL
4    FAIL
5    FAIL
6    FAIL
7    PASS
8    PASS
9    FAIL
Name: binary, dtype: object

You can also look at all of the unique values in a particular column using the .unique() function.

In [67]:
df['binary'].unique()

array(['FAIL', 'PASS'], dtype=object)

### Pandas Index
An index is an immutable ndarray that implements an ordered, sliceable set. 
Indices are requires for pandas objects, but are automatically created as a RangeIndex if not otherwise given or set.

In [68]:
df.index

RangeIndex(start=0, stop=1794, step=1)

We can set a new index using an existing column in the df. 
`.set_index()` can take a couple of arguments: 
1. the name of the column to use
2. whether to use the existing column as the index, where `inplace=True`.

In [69]:
df.set_index('title', inplace=True)

In [70]:
df.index

Index(['21 & Over', 'Dredd 3D', '12 Years a Slave', '2 Guns', '42', '47 Ronin',
       'A Good Day to Die Hard', 'About Time', 'Admission', 'After Earth',
       ...
       'The Sting', '1776', 'Pink Flamingos', 'The Godfather',
       'Escape from the Planet of the Apes', 'Shaft', 'Straw Dogs',
       'The French Connection', 'Willy Wonka & the Chocolate Factory',
       'Beyond the Valley of the Dolls'],
      dtype='object', name='title', length=1794)

We can also sort a dataframe by its index.

In [71]:
df.sort_index(inplace=True)
df.tail(5)

Unnamed: 0_level_0,year,imdb,test,clean_test,binary,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Zoolander,2001,http://www.imdb.com/title/tt0196229,ok-disagree,ok,PASS,28000000,45172250.0,60780981.0,2001PASS,36843687,59439723.0,79978408.0
Zoom,2006,http://www.imdb.com/title/tt0383060,ok,ok,PASS,35000000,11989328.0,12506188.0,2006PASS,40452872,13857222.0,14454606.0
Zwartboek,2006,http://www.imdb.com/title/tt0389557,ok,ok,PASS,22000000,4398532.0,27238354.0,2006PASS,25427520,5083807.0,31481990.0
[Rec],2007,http://www.imdb.com/title/tt1038988,ok,ok,PASS,2100000,,27117954.0,2007PASS,2359441,,30468201.0
xXx,2002,http://www.imdb.com/title/tt0295701,notalk,notalk,FAIL,70000000,141930000.0,267200000.0,2002FAIL,90662545,183824786.0,346071886.0


Depending on what you want to accomplish, titles might not make for a very convenient index, so let's reset it.

In [72]:
df.reset_index(inplace=True)
df.index

RangeIndex(start=0, stop=1794, step=1)

You can also create more than one index, using a Multi-Index. Let's use both the year and binary columns as indexes.

In [73]:
df.set_index(['year','binary'], inplace = True)
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,title,imdb,test,clean_test,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
year,binary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2009,FAIL,(500) Days of Summer,http://www.imdb.com/title/tt1022603,notalk-disagree,notalk,7500000,32391374.0,60803210.0,2009FAIL,8142987,35168338.0,66015966.0
1999,PASS,10 Things I Hate About You,http://www.imdb.com/title/tt0147800,ok,ok,13000000,38177966.0,60414025.0,1999PASS,18180006,53390436.0,84486720.0
2013,FAIL,12 Years a Slave,http://www.imdb.com/title/tt2024544,notalk-disagree,notalk,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0
2010,FAIL,127 Hours,http://www.imdb.com/title/tt1542344,dubious-disagree,dubious,18000000,18335230.0,60735230.0,2010FAIL,19228173,19586277.0,64879307.0
2004,PASS,13 Going on 30,http://www.imdb.com/title/tt0337563,ok,ok,30000000,57139723.0,96439723.0,2004PASS,36995786,70464299.0,118928779.0


### The .loc indexer

The .loc indexer allows multidimensional selection based on the label of the index. This can be more powerful than using a rangeindex since we can look at specific combinations of factors.

Here we use .loc() to show the title column for all rows that have 2013 and PASS in the year and binary indexes.

In [74]:
df.loc[2013,'PASS']['title']

  """Entry point for launching an IPython kernel.


year  binary
2013  PASS                                 About Time
      PASS                                  Admission
      PASS                            American Hustle
      PASS                       August: Osage County
      PASS                        Beautiful Creatures
      PASS                               Blue Jasmine
      PASS                                     Carrie
      PASS                            Despicable Me 2
      PASS                                    Elysium
      PASS                                       Epic
      PASS                   Escape from Planet Earth
      PASS                                  Evil Dead
      PASS                         Fast and Furious 6
      PASS                                     Frozen
      PASS                      G.I. Joe: Retaliation
      PASS                                     Gloria
      PASS             Hansel & Gretel: Witch Hunters
      PASS                                 Kick-Ass 2
      PASS     

If you wanted to work with those titles later, you could save them to a list using the .to_list() function.

In [75]:
movies_2013_list = df.loc[2013,'PASS']['title'].to_list()
print(movies_2013_list[0:10])

['About Time', 'Admission', 'American Hustle', 'August: Osage County', 'Beautiful Creatures', 'Blue Jasmine', 'Carrie', 'Despicable Me 2', 'Elysium', 'Epic']


  """Entry point for launching an IPython kernel.


### Editing columns

We can rename, drop, and add new columns. Use .rename() to set a new name for the imdb column to "URL" since that's a more accurate description of that content.

In [76]:
df.rename(columns = {'imdb': 'URL'}, inplace = True)
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,title,URL,test,clean_test,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013
year,binary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2009,FAIL,(500) Days of Summer,http://www.imdb.com/title/tt1022603,notalk-disagree,notalk,7500000,32391374.0,60803210.0,2009FAIL,8142987,35168338.0,66015966.0
1999,PASS,10 Things I Hate About You,http://www.imdb.com/title/tt0147800,ok,ok,13000000,38177966.0,60414025.0,1999PASS,18180006,53390436.0,84486720.0
2013,FAIL,12 Years a Slave,http://www.imdb.com/title/tt2024544,notalk-disagree,notalk,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0
2010,FAIL,127 Hours,http://www.imdb.com/title/tt1542344,dubious-disagree,dubious,18000000,18335230.0,60735230.0,2010FAIL,19228173,19586277.0,64879307.0
2004,PASS,13 Going on 30,http://www.imdb.com/title/tt0337563,ok,ok,30000000,57139723.0,96439723.0,2004PASS,36995786,70464299.0,118928779.0


We can add a new blank column by declaring it in a similar manner as we would declare a new variable.

In [77]:
df['test_col'] = '' # you could also add a string or other content here that would be assigned to every row of the df column 'test_col'

In [78]:
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,title,URL,test,clean_test,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013,test_col
year,binary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2009,FAIL,(500) Days of Summer,http://www.imdb.com/title/tt1022603,notalk-disagree,notalk,7500000,32391374.0,60803210.0,2009FAIL,8142987,35168338.0,66015966.0,
1999,PASS,10 Things I Hate About You,http://www.imdb.com/title/tt0147800,ok,ok,13000000,38177966.0,60414025.0,1999PASS,18180006,53390436.0,84486720.0,
2013,FAIL,12 Years a Slave,http://www.imdb.com/title/tt2024544,notalk-disagree,notalk,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,


We could also take a more sophisticated approach, and create a new column by editing the data from another column. Here we use the str.split() function to split the URL column on the fourth '/' that occurs, and send the content following that fourth forward slash to a new column that we'll call 'imdb_id.'

1. First we declare a new column. 
2. It will equal the df['URL'] column which we split using str.split(). 
3. str.split() first takes a parameter of the string that we want to split on. 
4. The second parameter, expand=True, tells the function to expand the split strings into separate columns.
5. And the [4] refers to the fourth occurrence of the '/' in the string. 

In [79]:
df['imdb_id'] = df['URL'].str.split('/',expand = True)[4]
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,title,URL,test,clean_test,budget,domgross,intgross,code,budget_2013,domgross_2013,intgross_2013,test_col,imdb_id
year,binary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2009,FAIL,(500) Days of Summer,http://www.imdb.com/title/tt1022603,notalk-disagree,notalk,7500000,32391374.0,60803210.0,2009FAIL,8142987,35168338.0,66015966.0,,tt1022603
1999,PASS,10 Things I Hate About You,http://www.imdb.com/title/tt0147800,ok,ok,13000000,38177966.0,60414025.0,1999PASS,18180006,53390436.0,84486720.0,,tt0147800
2013,FAIL,12 Years a Slave,http://www.imdb.com/title/tt2024544,notalk-disagree,notalk,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,,tt2024544
2010,FAIL,127 Hours,http://www.imdb.com/title/tt1542344,dubious-disagree,dubious,18000000,18335230.0,60735230.0,2010FAIL,19228173,19586277.0,64879307.0,,tt1542344
2004,PASS,13 Going on 30,http://www.imdb.com/title/tt0337563,ok,ok,30000000,57139723.0,96439723.0,2004PASS,36995786,70464299.0,118928779.0,,tt0337563


### Logic Looping

Let's look at two powerful tools to to build efficient code and to clean and transform your data:
* if/else statements
* for loops

#### if/else
We can use if and else statements to provide conditional responses to particular cases. 

Note that there's a difference between the = operator which assigns a value to a variable and a conditional == which checks to see if something is true or not. 

You can also use greater than and less than comparison/relational operators. Here's a [cheat sheet](https://www.dummies.com/programming/python/beginning-programming-python-dummies-cheat-sheet/) of all kinds of python operators. 

In [80]:
a = 7
if a == 7:
    print('a is equal to 7')
else:
    print('a is not equal to 7')

a is equal to 7


We can build a stack of conditionals using 'elif'.

In [81]:
a = 5
if a == 7:
    print('a is equal to 7')
elif a == 5:
    print('a is equal to 5')
else:
    print('a is not equal to 7 or 5')

a is equal to 5


#### For loop syntax
A for loop always iterates through a collection of items, like a list.
```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

In [82]:
odds = [1, 3, 7, 9]
for num in odds:
    print(num)

1
3
7
9


We can combine for loops and if/else statements to check values (and do things) while iterating through a collection. 

There are more efficient ways to do the following, but this is an easy-to-read version. We'll also stack a variety of conditionals into a single if statement using the or operator.

In [83]:
nums = [1,2,3,4,5,6,7,8,9]

for num in nums:
    if num == 1 or num == 3 or num == 5 or num == 7 or num == 9:
        print(num, "is odd")
    else:
        print(num, "is even")

1 is odd
2 is even
3 is odd
4 is even
5 is odd
6 is even
7 is odd
8 is even
9 is odd


### Looping in pandas dataframes
Dataframes require specific functions to use a for loop effectively. .iterrows(), for example, iterates over each row in the dataframe, returning something called a "Series" for each row. 

In [84]:
# we could save these to a list instead of printing them up
print('Wow, these were some really expensive movies:')
for index, row in df.iterrows():
    if row['budget'] > 200000000:
        print(row['title'])

Wow, these were some really expensive movies:
47 Ronin
Avatar
Battleship
Harry Potter and the Half-Blood Prince
John Carter
King Kong
Man of Steel
Men in Black III
Pirates of the Caribbean: At World's End
Pirates of the Caribbean: Dead Man's Chest
Pirates of the Caribbean: On Stranger Tides
Quantum of Solace
Robin Hood
Superman Returns
Tangled
The Amazing Spider-Man
The Chronicles of Narnia: Prince Caspian
The Dark Knight Rises
The Golden Compass
The Hobbit: An Unexpected Journey
The Hobbit: The Desolation of Smaug
The Lone Ranger
Transformers: Revenge of the Fallen
X-Men: The Last Stand


We can combine the pass/fail index — df['binary'] - that we created above to explore which high- and low-budget movies passed the Bechdel test.

In [85]:
print("Wow, these were some really expensive movies, but at least they passed the Bechdel test:", '\n')
for index, row in df.iterrows():
    if row['budget'] > 200000000 and 'PASS' in index: # $200 million + movies!
        print(row['title'], index)

Wow, these were some really expensive movies, but at least they passed the Bechdel test: 

Harry Potter and the Half-Blood Prince (2009, 'PASS')
John Carter (2012, 'PASS')
Man of Steel (2013, 'PASS')
Pirates of the Caribbean: At World's End (2007, 'PASS')
Tangled (2010, 'PASS')
The Chronicles of Narnia: Prince Caspian (2008, 'PASS')
The Golden Compass (2007, 'PASS')
Transformers: Revenge of the Fallen (2009, 'PASS')
X-Men: The Last Stand (2006, 'PASS')


In [86]:
# low budget movies
for index, row in df.iterrows():
    if row['budget'] < 1000000 and 'PASS' in index: # Less than a million $
        print(row['title'], index)

Another Earth (2011, 'PASS')
Chasing Amy (1997, 'PASS')
Circumstance (2011, 'PASS')
Funny Ha Ha (2002, 'PASS')
Hard Candy (2005, 'PASS')
Like Crazy (2011, 'PASS')
London To Brighton (2006, 'PASS')
Love and Other Catastrophes (1996, 'PASS')
Lovely and Amazing (2001, 'PASS')
Pieces of April (1993, 'PASS')
Pink Flamingos (1972, 'PASS')
Quinceanera (2006, 'PASS')
Safety Not Guaranteed (2012, 'PASS')
Sisters in Law (2005, 'PASS')
Slacker (1991, 'PASS')
The Blair Witch Project (1999, 'PASS')
The Incredibly True Adventures of Two Girls in Love (1995, 'PASS')
Tiny Furniture (2010, 'PASS')
Undead (2003, 'PASS')
Welcome to the Dollhouse (1995, 'PASS')
Wendy and Lucy (2008, 'PASS')
When the Cat's Away (1996, 'PASS')
Yesterday Was a Lie (2009, 'PASS')
Your Sister's Sister (2011, 'PASS')


We could put this all together and create a new column to track different levels of movie budgets, and then compare how many films in each category pass the Bechdel test.

In [87]:
for index, row in df.iterrows():
    if row['budget'] < 1000000: #100k
        df.loc[index,'budget_class'] = "cheap"
    elif row['budget'] > 1000000 and row['budget'] < 50000000:
        df.loc[index,'budget_class'] = "medium"
    elif row['budget'] > 50000000:
        df.loc[index,'budget_class'] = "expensive"

  res = shell.run_cell(code, store_history=store_history, silent=silent)
  raw_cell, store_history, silent, shell_futures)


In [88]:
df.reset_index(inplace=True)

In [89]:
df.set_index(['budget_class', 'binary'], inplace = True)

In [90]:
print('Cheap movies:')
print(df.loc['cheap','PASS']['title'].count(), 'pass : ', df.loc['cheap','FAIL']['title'].count(), 'fail \n')

print('Mid-level movies:')
print(df.loc['medium','PASS']['title'].count(), 'pass : ', df.loc['medium','FAIL']['title'].count(), 'fail \n')

print('Expensive movies:')
print(df.loc['expensive','PASS']['title'].count(), 'pass : ', df.loc['expensive','FAIL']['title'].count(), 'fail \n')


Cheap movies:
93 pass :  6 fail 

Mid-level movies:
586 pass :  511 fail 

Expensive movies:
123 pass :  474 fail 



  
  """
  


### Functions
We could make the above much more efficient, and also more extensible by using a function.

- All functions have a return value.
- If there is no return inside the code, the return value is None.
- Functions can be used like any other data type

You can define a new function using 'def'.

This function, which we'll call 'power' accepts two parameters: x and y. Since y has a default value of 2, if the user doesn't enter a number this will raise x to the value of two.

The first line of the function is a way to describe what the function does in a string comment (using three quotes allows the comment to wrap across multiple lines).

The second line assigns a new variable, retval, using the x and y values entered when one calls the function.

The third line returns the retval value.

In [91]:
def power(x, y=2):
    '''Raises the value of x to the y-th power'''
    retval = x ** y
    return retval

4

We can now call the function in the same way we would call any pre-defined Python functions.

In [92]:
power(2)

4

In [93]:
power(2, 3)

8

In [95]:
power(4,10)

1048576

Another way to calculate budget categories would be to write a function and then use df.apply() to call the function. Here's our own budget_check() function.

In [96]:
def budget_check(row):
    if row['budget'] < 1000000:
        return 'cheap'
    elif row['budget'] > 1000000 and row['budget'] < 50000000:
        return 'medium'
    else: 
        return 'expensive'    

There's no output to the function itself, because we still need to run it by passing it as a parameter to df.apply().

In [97]:
df['budget_check'] = df.apply(budget_check, axis = 1)

Now we can just count the occurrence of each possible column value.

In [98]:
df['budget_check'].value_counts()

medium       1134
expensive     609
cheap          51
Name: budget_check, dtype: int64

You can continue to look at various configurations of the Bechdel data, by assigning the 'budget_check' column as an index and writing new functions or other code to output your results!