# Introduction to Python for Social Science
- UMN LATIS workshop, Oct 16, 2020
- Michael Beckstrand (mjbeckst@umn.edu) and Cody Hennesy (chennesy@umn.edu)

* Use Python 3 in a JupyterLab computing environment
* Use Python to grab data from a large number of files quickly
* Load a comma-delimited spreadsheet (.csv) into Pandas as a dataframe
* View and clean that data
* Save cleaned data file in formats for later use

### Jupyter lab environment
- How to navigate the directory of folders
- How to create a cell
- How to run a cell

### Variables
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations and save text strings in variables too.

In [None]:
weight_1 = 140
print(weight_1)

In [None]:
weight_2 = 60
print(weight_1 + weight_2)

In [None]:
sentence = "This is a string of text."
print(sentence)

### Types
Everything in Python is some type of object. Objects contain attributes, usually data and related functions,
called methods.

In [None]:
print(type(weight_1))
print(type(sentence))
print(type(140.0))
print(type(print))

### Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [None]:
my_list = ['winter', 'sprung', 'summer', 'fall'] # this list has a typo
print(my_list)

In [None]:
type(my_list)

To fetch the element at a specific location, put the *index* of that location in square brackets. But keep in mind that Python lists start the index from 0. So the list above has three index values: ```my_list[0] my_list[1] my_list[2]```

In [None]:
print(my_list[1])

In [None]:
my_list[1] = 'spring'
print(my_list[1])

Lists can contain different types of Python objects, and you can even create lists of lists.

In [None]:
mixed_list = ['word', 3, 10.2, ['list', 'of', 'items']]
print(mixed_list[3])
print(type(mixed_list[3]))

Using similar syntax as a list index, you can look at slices of parts of a list:

In [None]:
mixed_list[1:3]

In [None]:
mixed_list[2:]

In [None]:
mixed_list[:2]

Sometime you'll want to create an empty list (which you can fill later). You can do so, by declaring a list variable with empty square brackets.

In [None]:
future_list = []

Now let's create a list of four numbers and assign it to the variable 'a'.

In [None]:
a = [1, 2, 3, 4]

Create another variable 'b' that references the variable you created in the previous step.

In [None]:
b = a

Now change the first item in the list that 'b' references.

In [None]:
b[0] = 7

Look at a and b. Are they the same or did the change you made to 'b' in the previous step also change 'a'?

In [None]:
a

In [None]:
b

In [None]:
b = a.copy()

In [None]:
b[0] = 1

In [None]:
a

In [None]:
b

## Importing Libraries (aka packages)
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.


Python has a huge number of built-in packages, its standard library. 
Some methods are always available; others we must import.

In [None]:
rand = random.randint(1,10)

In [None]:
import random

In [None]:
rand = random.randint(1,10)
print(rand)

### Loading tabular data (such as a CSV) into Pandas
- Use the Pandas library to do statistics on tabular data.
- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Borrows many features from R’s dataframes.
 - A 2-dimensional table whose columns have names and potentially have different data types.
- Load it with `import pandas as pd`. The alias `pd` is commonly used for Pandas.

In [None]:
import pandas as pd

Read a Comma Separate Values (CSV) data file with `pd.read_csv`.
- Argument is the name of the file to be read.
- Assign result to a variable to store the data that was read.

In [None]:
df = pd.read_csv("bechdel_data.csv")

In [None]:
#this is not helpful
print(df)

### Examining your data
After reading in the data, do a quick check to see what it looks like. Start by looking at the first 5 lines of the data frame.

In [None]:
df.head(5)

Now look at the last 5 lines.

In [None]:
df.tail(5)

You can look at specific rows of the data frame using the same syntax that we used to slice a list.

In [None]:
df[100:105]

Finally, look at the metadata of the data frame.

In [None]:
df.info()

In [None]:
df.columns

Now let's see what types of data we have in our dataset. What type of data is in the 'year' column in the dataset?

In [None]:
type(df['year'])

There are some nice built-in functions for giving summary statistics of dataframes, their columns and rows.

Let's look at the year column:

In [None]:
df['year'].describe()

You can also apply these statistical functions directly to the data in a column:

In [None]:
df['year'].max()

Let's look at the 'binary' column.

In [None]:
type(df['binary'])

Since this is a series (which operates a little bit like a list) we can use an index value to look at the first value (remember the index starts at zero). 

The most basic way to select subsets of the data is done using the `[ ]` operator. For a DataFrame, this works as the Series corresponding to the column label.

In [None]:
type(df['binary'][0])

In [None]:
df['binary'][0]

In [None]:
df['binary'][0:10]

You can also look at all of the unique values in a particular column using the .unique() function.

In [None]:
df['binary'].unique()

### Pandas Index
An index is an immutable ndarray that implements an ordered, sliceable set. 
Indices are requires for pandas objects, but are automatically created as a RangeIndex if not otherwise given or set.

In [None]:
df.index

We can set a new index using an existing column in the df. 
`.set_index()` can take a couple of arguments: 
1. the name of the column to use
2. whether to use the existing column as the index, where `inplace=True`.

In [None]:
df.set_index('title', inplace=True)

In [None]:
df.index

We can also sort a dataframe by its index.

In [None]:
df.sort_index(inplace=True)
df.tail(5)

Depending on what you want to accomplish, titles might not make for a very convenient index, so let's reset it.

In [None]:
df.reset_index(inplace=True)
df.index

You can also create more than one index, using a Multi-Index. Let's use both the year and binary columns as indexes.

In [None]:
df.set_index(['year','binary'], inplace = True)
df.head(5)

### The .loc indexer

The .loc indexer allows multidimensional selection based on the label of the index. This can be more powerful than using a rangeindex since we can look at specific combinations of factors.

Here we use .loc() to show the title column for all rows that have 2013 and PASS in the year and binary indexes.

In [None]:
df.loc[2013,'PASS']['title']

If you wanted to work with those titles later, you could save them to a list using the .to_list() function.

In [None]:
movies_2013_list = df.loc[2013,'PASS']['title'].to_list()
print(movies_2013_list[0:10])

### Editing columns

We can rename, drop, and add new columns. Use .rename() to set a new name for the imdb column to "URL" since that's a more accurate description of that content.

In [None]:
df.rename(columns = {'imdb': 'URL'}, inplace = True)
df.head(5)

We can add a new blank column by declaring it in a similar manner as we would declare a new variable.

In [None]:
df['test_col'] = '' # you could also add a string or other content here that would be assigned to every row of the df column 'test_col'

In [None]:
df.head(3)

We could also take a more sophisticated approach, and create a new column by editing the data from another column. Here we use the str.split() function to split the URL column on the fourth '/' that occurs, and send the content following that fourth forward slash to a new column that we'll call 'imdb_id.'

1. First we declare a new column. 
2. It will equal the df['URL'] column which we split using str.split(). 
3. str.split() first takes a parameter of the string that we want to split on. 
4. The second parameter, expand=True, tells the function to expand the split strings into separate columns.
5. And the [4] refers to the fourth occurrence of the '/' in the string. 

In [None]:
df['imdb_id'] = df['URL'].str.split('/',expand = True)[4]
df.head(5)

### Logic Looping

Let's look at two powerful tools to to build efficient code and to clean and transform your data:
* if/else statements
* for loops

#### if/else
We can use if and else statements to provide conditional responses to particular cases. 

Note that there's a difference between the = operator which assigns a value to a variable and a conditional == which checks to see if something is true or not. 

You can also use greater than and less than comparison/relational operators. Here's a [cheat sheet](https://www.dummies.com/programming/python/beginning-programming-python-dummies-cheat-sheet/) of all kinds of python operators. 

In [None]:
a = 7
if a == 7:
    print('a is equal to 7')
else:
    print('a is not equal to 7')

We can build a stack of conditionals using 'elif'.

In [None]:
a = 5
if a == 7:
    print('a is equal to 7')
elif a == 5:
    print('a is equal to 5')
else:
    print('a is not equal to 7 or 5')

#### For loop syntax
A for loop always iterates through a collection of items, like a list.
```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

In [None]:
odds = [1, 3, 7, 9]
for num in odds:
    print(num)

We can combine for loops and if/else statements to check values (and do things) while iterating through a collection. 

There are more efficient ways to do the following, but this is an easy-to-read version. We'll also stack a variety of conditionals into a single if statement using the or operator.

In [None]:
nums = [1,2,3,4,5,6,7,8,9]

for num in nums:
    if num == 1 or num == 3 or num == 5 or num == 7 or num == 9:
        print(num, "is odd")
    else:
        print(num, "is even")

### Looping in pandas dataframes
Dataframes require specific functions to use a for loop effectively. .iterrows(), for example, iterates over each row in the dataframe, returning something called a "Series" for each row. 

In [None]:
# we could save these to a list instead of printing them up
print('Wow, these were some really expensive movies:')
for index, row in df.iterrows():
    if row['budget'] > 200000000:
        print(row['title'])

We can combine the pass/fail index — df['binary'] - that we created above to explore which high- and low-budget movies passed the Bechdel test.

In [None]:
print("Wow, these were some really expensive movies, but at least they passed the Bechdel test:", '\n')
for index, row in df.iterrows():
    if row['budget'] > 200000000 and 'PASS' in index: # $200 million + movies!
        print(row['title'], index)

In [None]:
# low budget movies
for index, row in df.iterrows():
    if row['budget'] < 1000000 and 'PASS' in index: # Less than a million $
        print(row['title'], index)

We could put this all together and create a new column to track different levels of movie budgets, and then compare how many films in each category pass the Bechdel test.

In [None]:
for index, row in df.iterrows():
    if row['budget'] < 1000000: #100k
        df.loc[index,'budget_class'] = "cheap"
    elif row['budget'] > 1000000 and row['budget'] < 50000000:
        df.loc[index,'budget_class'] = "medium"
    elif row['budget'] > 50000000:
        df.loc[index,'budget_class'] = "expensive"

In [None]:
df.reset_index(inplace=True)

In [None]:
df.set_index(['budget_class', 'binary'], inplace = True)

In [None]:
print('Cheap movies:')
print(df.loc['cheap','PASS']['title'].count(), 'pass : ', df.loc['cheap','FAIL']['title'].count(), 'fail \n')

print('Mid-level movies:')
print(df.loc['medium','PASS']['title'].count(), 'pass : ', df.loc['medium','FAIL']['title'].count(), 'fail \n')

print('Expensive movies:')
print(df.loc['expensive','PASS']['title'].count(), 'pass : ', df.loc['expensive','FAIL']['title'].count(), 'fail \n')
