# Introduction to Python

Welcome to your guided first steps on Python. In today's lesson, we'll walk you through some of the core concepts of Python, including variables, functions, loops, and logic. We will introduce these concepts in the context of tasks you might often find yourself doing when using Python to work with social data.

# Jupyter Notebooks

Let's start by familarizing ourselves with the app we'll be using to run Python code: the Jupyter Notebook.

Jupyter provides a web-based interactive computing platform where we can use code, narrative text, equations and visualizations. It will allow us to practice python programming. 

Throughout a notebook, you will see this cell or box:

You can write code inside a cell by clicking inside the box to select it.

When we write code in it, we can run it by clicking the run button on the top that looks like a play button.

In [None]:
print(3+5)

Another way to run the code on your box is with the keyboard shortcut *Shift + Enter*

In [None]:
print(2+2)

When you write code, it can be a good idea to leave a *comment* that explains in plain english what you are doing. In python, you do this by starting a line with the # sign. You can write any text you want after the # sign, and python will ignore it.

In [None]:
# Lines starting with the # sign are comments, which are ignored by python and can contain any text you want.
# The code below adds two numbers and prints the result
print(1+1)

To create a new cell, press the "plus" button on the top. This will add a new cell below the currently selected one.

# Variables and Statements

Learning objectives
* Understand what a variable is
* Know the difference between variables and values
* Create a variable
* Do things with a variable

We will first learn to create **variables**. 

A Python variable is a memory space used to store values.

You can think of this memory space as a container. The container is assigned a **name**, and inside of it, it contains a specific value, such as a letter or a number.

A **value** is the fundamental thing (e.g., a letter, a number, a sign or a combination of these) that the program manipulates. For our purposes, we will also refer to them as **objects**.

We can easily draw a comparison to mathematical variables in social science. For example, in a study you might have collected participant's ages. Suppose Participant A is age 29. As a social scientist, you might record this data under the label "Participant_A_Age". In this example, the label "Participant_A_Age" is the *name*, and the actual numerical age, 29, is the *value*.

In Python, we can represent this using the "=" syntax, which assigns a value to a named variable. The name goes on the left side and the value goes on the right side. Variable names can be any combinations of letters, numbers and underscores.

In [None]:
Participant_A_Age = 29

From now on, in any code we write, our notebook will remember the value assigned to the variable "Participant_A_Age". We can therefore use the name "Participant_A_Age" to access the stored value, for example we can print it out: 

In [None]:
print(Participant_A_Age)

Naming variables and using the function *print* will be enough for us to program the writing of a first simple sentence.  

For example, if I create variables for a participant's identifier and their age, then I can recall these values into a sentence that uses them.

The rest of the text in my sentence, I can include by separating it from the variables with commas and enclosing it in quotation marks. Like so:

In [None]:
age_1= 34
Response_ID = 1
print('Participant', Response_ID, 'is', age_1, 'years old.')

# Types of Variables

According to the kind of values that make them up, variables can be of different types.

This is important because different kinds of variables will be subject to different rules and operations.

What kind of variables can you think of? (Hint: consider the different kinds of data you might collect in a study)

*(We'll pause here to let you make some guesses)*

The first type of variables we will look at are those made of **integers** and we will call them *(int)*

An integer is a number that we can count. A whole number. Say, 3. Or 345. Intergers allow us to count the number of whole things we have. 

Fractions (or numbers with decimals) instead are called **floating numbers** *(float)*

A floating number is a fractional number and it allows us to measure the dimension of something. For example, my thumb is 59.32 mm long.

Lastly, we have variables made of letters or text and symbols, and we call those string variables or **strings** *(str)*.

Here are some examples of all three types:

In [None]:
Participant_B_Age = 34 # an int variable
Participant_B_Salary = 42.75 # a float variable
Participant_B_Hometown = "Ithaca" # a str variable

Try setting a variable for your name and one for the city where you were born. Then have the notebook print out a sentence saying where you were born.

What type of variable were the name and city?

In [None]:
# Write your solution here


You can use the function *type* to have the notebook tell you what type a variable is.

In [None]:
type(Participant_B_Hometown)

The function *type* can also be used with values rather than variable names:

In [None]:
type(52)

# Calculations with Variables

How can we have the notebook print how old Participant 1 will be in three years?

We already have a variable for their age. So we can create a new one for their age in three years and have it print out.


In [None]:
Later_Age_1 = age_1 + 3
print(Later_Age_1)

This has added an *int variable* with an *int value*.

To multiply we will use the asterix symbol *

In [None]:
b=3*3
print(b)

This has multiplied two *int values*.

To divide we will use the slash symbol /.

In [None]:
c=5/2
print(c)


In this case we divided two *int* values and obtained a *float* one

To square a number we will use the double asterisk symbol `**`.

In [None]:
d=3**2
print(d)

Many operations are valid for string values and variables too.

By "adding them" with a '+' we can concatenate them, like blocks into sentences.

I can create a variable made of two string values for my full name. I make sure to separate the string values with quotation marks:

In [None]:
Hometown = 'Saint'+'Louis'

print(Hometown)

As you can see, the two words have been assembled one next to the other. 

I may want to add a space here. What kind of value is a space and how can you add it?

In [None]:
type(' ')

In [None]:
Hometown = 'Saint'+' '+'Louis'
print(Hometown)

Numbers written in between ' ' will show as string variables instead of integers variables.

In [None]:
print('Participat', Response_ID, 'is', '34')

What would happen if I want to add a string value to an integer value? 

In [None]:
e='participant'+1
print(e)

The code cell above holds a mistake because variables of different types cannot be added.

# Operations Mixing Variable Types

Python has a trick that will allows us to use different variable types in one same operation: casting.

Casting a variable or value is converting it to another type. To convert it, we will type first the type we want the variable to have, and right after, we type in a parenthesis the variable. For example: *str(1)*.

We saw before that writing a number between quotation marks will show it as a string value. If  that is true, then adding two numbers in quotations would concatenate them:


In [None]:
print('1' + '2')

How could I convert (cast) this string ('2') into an integer to use it in a sum?

In [None]:
print(1 + int('2'))

We can also cast an int value into a string:

In [None]:
print(str(1) + '2')

Great. Questions so far?

# Functions

Learning Objectives:

*   Understand the concept of a function and syntax of a function
*   View examples of methods

A function is like a button, that when pressed (called) will do an action to a variable. 

When we created a variable, we taught the program what value to recall when we write the variable name.

When we create a function, we instead teach the program what action to perform when we write the function name.

To create a function, use the following syntax:
- The keyword "def" (short for "define")
- The name of the function (like variables, can contain letters, numbers, and underscores)
- Parentheses containing one or more *arguments* (we'll explain what this means shortly)
- A colon after the closing parentheses
- One or more lines of code that are indented. These indented lines of code define the actions that will be performed in the function, and are referred to as the function's *body*.

In [None]:
def increase_age(starting_age):
    later_age = starting_age + 3
    print("The age after 3 years will be", later_age)

The above example defines a function that adds 3 to a starting_age and prints the result in a sentence. Importantly, in this definition, starting_age does not refer to a specific variable or value. Instead, it is a placeholder, which in Python is referred to as an *argument*. The idea is that the function represents a "generic" action that can be done to any variable or value. When you call the function (press the button), you have to tell it what *specific* variable or value to act on. You do this typing the function name and then putting the variable or value in parentheses:

In [None]:
increase_age(age_1)

Functions can have multiple arguments. The arguments are separated by commas.

In [None]:
def summarize(who, how_old):
    later_age = how_old + 3
    print("Participant " + str(who) + " " + "is " + str(how_old) + " years old, and will be " + str(later_age) + " " + "in three years.")

In [None]:
summarize(Response_ID, age_1)

Functions are also able to produce *results* when they are called. This is done using the "return" keyword, and the result can be saved to a variable. To continue the button analogy, think of this as "when you press the button, something comes out":

In [None]:
def celsius_to_fahrenheit(celsius_temp):
    return (9 / 5) * celsius_temp + 32

todays_temperature_C = 32
# the result returned by celsius_to_fahrenheit can be saved to a variable
todays_temperature_F = celsius_to_fahrenheit(todays_temperature_C)
# we can now use this variable in our code, like any other variable
print("Today's temperature is " + str(todays_temperature_C) + " in celsius, which is " + str(todays_temperature_F) + " in fahrenheit")

Functions that return results are very useful when there are calculations that you expect to do multiple times, such as unit conversion like in the example above. By writing the math only once and saving it in a function, you don't need to write the full calculation every time you need it, you just need to call the function, which is much neater.

In the previous examples we created our own functions, but there are many already built in functions. 

We have already used a few: *print(), int(), float(), type()*.

Let's use three other examples of built-in functions: *max()*, *min()*, and *round()*.

We can find the maximum value of an integer list with the function *max*.

In [None]:
max(1, 2, 3)

This function will work with string values too analyzing them in alphabetical order.

In [None]:
min('a', 'b', 'c')

The built-in function *round()* will round a floating number to the number of decimals of your choosing. 

This specification of how many decimals the function must round up to is called a *parameter*.

Parameters are included in the parenthesis after the function name and are separated by comas:

*function(parameter, parameter)*

To round the floating number 5.389 to 1 decimal, we will do the following:

In [None]:
round(5.389, 1)

There are also special functions that *belong to* an object, known as **methods**. You can think of a method as "something that the object can do". Different types of objects have different methods. For example, string objects have a method called "upper" which tells the string to turn itself into uppercase. By contrast, integer objects do not have this method, because integers cannot turn themselves into uppercase.

Methods are called by adding a dot *.* after a variable or value, then typing the method name. So for example, if we have a string variable and want to call the "upper" method:

In [None]:
first_name = 'sam'
first_name_uppercase = first_name.upper() # call the "upper" method of the string variable "first_name"
print(first_name_uppercase)

This method can be used on any string value, not just variables. For example:


In [None]:
'Sam'.upper()

You can find out what a method does by using the function *help*.


In [None]:
help(str.upper)

Find our what these other methods do:

*str.replace*

*str.find*

*str.lower*

In [None]:
help(str.replace)

Great job. Questions so far?



<h1>Lists</h1>

Learning Objectives:
- Understand how to create and modify a list, and what a list can and can't do
- Become familiar with common list methods

A list is an ordered, indexable collection of data. Lets say you're doing a study on the following hometowns: Brooklyn, Atlanta, Hampton, Brentwood, and Lexington.


You could put that data into a list.

To create a list, put your desired values inside square brackets [...],
where each value is separated by a comma ,.

In [None]:
Hometown_list = ["Brooklyn", "Atlanta", "Hampton", "Brentwood", "Lexington"]
type(Hometown_list)


<h1>Use an item’s index to fetch it from a list.</h1>

- Each value in a list is stored in a particular location.
- Locations are numbered from 0 rather than 1.
- Use the location’s index in square brackets to access the value it contains.

["Brooklyn", "Atlanta", "Hampton", "Brentwood", "Lexington"]

index: 0             1          2          3          4

In [None]:
print('the first item is:', Hometown_list[0])
print('the fourth item is:', Hometown_list[3])

Lists can be indexed from the back using a negative index.

["Brooklyn", "Atlanta", "Hampton", "Brentwood", "Lexington"]

index: -5 -4 -3 -2 -1

In [None]:
print(Hometown_list[-1])
print(Hometown_list[-2])

## "Slice" a list using [ : ]

- We can get multiple items from a list using slicing
- Note that the first index is included, while the second is excluded

In [None]:
print(Hometown_list[1:4])

You can leave the first index blank if you want to start from the beginning of the list. Likewise, you can leave the second index blank if you want to end with the last item in the list.

In [None]:
print(Hometown_list[:4])

In [None]:
print(Hometown_list[2:])

## Use index in a string

The same indexing and slicing syntax can also be used to access individual letters in a string!

In [None]:
Hometown = "New York"
Hometown[1]

Lets get the first eight letters

In [None]:
Hometown[0:8]

# Changing values in a list
Lists’ values can be replaced by assigning to specific indices.

In [None]:
Hometown_list[0] = "New York"
print('Hometown List is now:', Hometown_list)

Note, however, that the same does not apply to strings - you cannot change characters in a string with the same syntax. We will explain this more in a later workshop.

In [None]:
Hometown = "New York"
Hometown[0] = 'C'

## Lists have Methods
- Just like strings have methods, lists do too.
- IPython lets us do tab completion after a dot ('.') to see what an object has to offer.

In [None]:
Hometown_list.

If you want to append items to the end of a list, use the append method.

In [None]:
Hometown_list.append("Seattle")
print(Hometown_list)

## Use del to remove items from a list entirely.
- del list_name[index] removes an item from a list and shortens the list.
- Not a function or a method, but a statement in the language.

In [None]:
print("original list was:", Hometown_list)
del Hometown_list[3]
print("the list is now:", Hometown_list)

## Challenge 1: Slice It

Try it yourself! Using the list "Answers" in the cell below, use indexing and slicing to fetch the following:
- The fifth item in the list
- The second to last item in the list
- The first 4 items in the list
- The last 6 items in the list
- A slice containing only ['Kentwood', 'Washington']

In [None]:
Answers = [1,3,8.75,20,6, 'Kentwood', 'Washington', 200, 2, 'Tulsa']


## Challenge 2: Index
I've created a (long) list for you below. Use the .index() method to find out what the index number is for Waldo.

Remember, to get help on how to use a method, use the "help" function.

In [None]:
Wheres_Waldo = ["Anna", "Shad", "Rachel", "Maura", "Jason", "Matt", "Konrad", "Justine", "Sarah", "Laura", \
                "Chelsea", "Nina", "Dierdre", "Julian", "Waldo", "Naniette", "Melissa", "Biz", "Elsa", "Demetria",\
                "Liz", "Olivia", "Will", "Ogi", "Melanie", "Jessica"]
waldo_index = _________________ # fill in the blank
print(waldo_index)

## Challenge 3: Join
Read the help file (or the Python documentation) for join(), a string method.

In [None]:
help(str.join)

Using the join method, concatenate all the values in this list into one string:

In [None]:
letters = ['N', 'a', 's', 'h', 'v', 'i', 'l', 'l', 'e']

In [None]:
result = ______________ # fill in the blank
print(result)

Great Job! Next, we will look into Dictionaries.

## Dictionaries
Learning Objectives:


* What is a dictionary?

* What are the advantages of dictionaries?

* How do I use the content of a dictionary?

* Examining

* Modifying

* Iterating

* Methods


**What is a dictionary?** 

A dictionary is a collection of organized elements in pairs of keys:values.

In [None]:
subjects_dict = {"Name": "Forough Farrokhzad", 
            "Age": 21,  
            "Response_ID": "1", 
            "Self_Confidence": "Agree Slightly", 
            "surveys": ["DevContext","UseOfSpace","BrainInContext","SocialNutrition"]}

for this reason, dictionaries are also called key-value pairs.

In this example,

the **keys** are **name, age, response_ID, Self_Confidencee** and **surveys**, and everything after the colon are the *values* assigned to that key: *Forough Farrokhzad, 21, 1, Agree Slightly, and ["DevContext","UseOfSpace","BrainInContext","SocialNutrition"]**

Dictionaries are defined with curly brackets holding the key-value pairs, written in the forman key:value and separated by comas.

**When should I use dictionaries and when should I use lists?**

If the data you are storing is complex and hierarchical, the dictionary's key / value structure is very helpful. This is the advantage of dictionaries.

Keys must be unique (there cannot be duplicates of the dictionary with the same key) and they cannot be changed.

Values, on the other hand, can be anything, including strings, intergers, booleans, lists of them or even other dictionaries.

Let's see an example with different data types: strings, booleans, integers and lists.

In [None]:
Developmental_Context = {
  "City": "Ithaca",
  "Urban": False,
  "Year": 2021,
  "Colleges": ["Cornell University", "Ithaca College"]
}

Here is another example of a dictionary called *valid_dict* containing two other dictionaries *dict_nums* and *dict_ints.*

In [None]:
valid_dict = {'dict_nums':{1:'one', 2:'two', 3:'three'},
             'dict_ints':{'one':1, 'two':2, 'three':3}}

In this case *'dict_nums'* and *'dict_ints'* are both values of a dictionary, and a dictionary themselves.

While dictionaries can be values in other dictionaries, they cannot be keys:

In [None]:
invalid_dict = {{1:'one', 2:'two', 3:'three'}:'dict_nums',
             {'one':1, 'two':2, 'three':3}:'dict_ints'}

**Examining a Dictionary**

You can use the function *print* and the methods *.values* and *.keys* to see the content of your dictionary.

Let's use this on our first example:

In [None]:
print(subjects_dict)

In [None]:
print(subjects_dict.keys())

In [None]:
print(subjects_dict.values())

You may want your notebook to show you a specific element from your dictionary. You can use the indexing syntax (square brackets) to look up what value goes with a specific key, like so:

In [None]:
subjects_dict["Self_Confidence"]

**Other methods and functions for Dictionaries**

The method .items can come in handy with dictionaries. What do you think it does?

In [None]:
subjects_dict.items()

If you want to know how many pairs there are in a dictionary, use len:

In [None]:
print(len(subjects_dict))

# Loops and Conditions
Time

Teaching: 10 min
Exercises: 15 min

Learning Objectives:

- Explain what loops are normally used for.
- Trace the execution of a simple (unnested) loop and correctly state the values of variables in each iteration.
- Write for loops that use the Accumulator pattern to aggregate values.


## A for loop executes commands once for each value in a collection.

Suppose we wanted to print three hometowns: Atlanta, Kentwood, Cincinnatti. We could do this by writing three print statements:

In [None]:
print('Atlanta')
print('Kentwood')
print('Cincinnatti')

But a better way to do this is by using a for loop over a list:

In [None]:
for hometown in ['Atlanta', 'Kentwood', 'Cincinnatti']:
    print(hometown)

## The first line of the for loop must end with a colon, and the body must be indented.
- The colon at the end of the first line signals the start of a block of statements.
- Python uses indentation rather than {} or begin/end to show nesting.
- Any consistent indentation is legal, but almost everyone uses four spaces.

## A for loop is made up of a collection, a loop variable, and a body.

In [None]:
for hometown in ['Atlanta', 'Kentwood', 'Cincinnatti']:
    print(hometown)

- The collection, ['Atlanta', 'Kentwood', 'Cincinnatti'], is what the loop is being run on.
- The body, print(hometown), specifies what to do for each value in the collection.
- The loop variable, hometown, is what changes for each iteration of the loop, the "current thing".

## The Accumulator pattern turns many values into one.
A common use of loops is to perform some computation on each item in a collection, and then save and/or combine the results in a new variable. This can be achieved using the *accumulator pattern*:
 - Initialize an accumulator variable to zero, the empty string, or the empty list.
 - Use a loop over a collection to perform computations on each value in the collection.
 - In the loop body, update the accumulator variable with the results of the computation.
 
In the below example, we have a list of ages collected in a survey three years ago. We use the accumulator pattern to create a new list representing the subjects' current ages.

In [None]:
survey_ages = [18, 21, 34, 40, 18, 19, 20]

present_ages = []
for age in survey_ages:
    present_ages.append(age+3)

print(present_ages)

## Challenge: Mean age
Below, we have another list of ages. Write code to compute the mean age.

HINT: Try using the accumulator pattern to compute the sum.

In [None]:
ages = [34, 52, 21, 42, 18, 36]

# fill in the blanks
accumulator = ______ # use this accumulator to hold the sum of ages. what should be its starting value?
for ________________:
    accumulator = ___________ # HINT: update the accumulator by adding the current age to the running sum
    
# remember that the mean is the sum divided by the total number of items.
# accumulator now contains the sum, so what's the last thing we need to do?
mean = ______________
print(mean)

# Conditionals
Learning Objectives:

- Being more specific with your data: 

  * Correctly write programs that use if and else statements and simple Boolean expressions (without logical operators).
  * Trace the execution of unnested conditionals and conditionals inside loops.

Keypoints:
 - Use if statements to control whether or not a block of code is executed.
 - Conditionals are often used inside loops.
 - Use else to execute a block of code when an if condition is not true.
 - Use elif to specify additional tests.
 - Conditions are tested once, in order.

## Use if statements to control whether or not a block of code is executed.
- An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
- Structure is similar to a for statement:
 - First line opens with if, contains a Boolean variable or expression, and ends with a colon
 - Body containing one or more statements is indented (usually by 4 spaces)

In [None]:
age_1 = 23
if age > 21
    print('Being ' + str(age_1) + ', participant 1 is of age.')

In [None]:
age_2 = 18
if age_2 < 21:
    print ('Being ' + str(age_2) + ', participant 2 is under age.')

# Conditionals are often used inside loops.
- Not much point using a conditional when we know the value (as above).
- But useful when we have a collection to process.

In [None]:
# select values that are underage

ages = [21, 19, 18, 21, 20]
for age in ages:
    if age < 21:
        print(age, 'is in underage group')

## Use else to execute a block of code when an if condition is not true.
- else is always attached to if.
- Allows us to specify an alternative to execute when the if branch isn't taken.

In [None]:
ages = [21, 19, 18, 21, 20]
for age in ages:
    if age < 21:
        print(age, 'is in underage group')
    else:
        print(age, 'is not in underage group')

# Use elif to specify additional tests.
- May want to provide several alternative choices, each with its own test.
- Use elif (short for "else if") and a condition to specify these.
- Always associated with an if.
- Must come before the else (which is the "catch all").

In [None]:
ages = [18, 88, 23, 97, 34]
for age in ages:
    if age > 65:
        print(age, 'is in elders group')
    elif age > 21:
        print(age, 'is in adult group')
    else:
        print(age, 'is in teenagers group')

## Challenge 1: Recoding the variable
In the previous example, we just printed out the name of the group. Try using the accumulator pattern instead to create a new list containing the age groups. (This is known in social science as *recoding* and is commonly used to turn numerical data into categorical data)

In [None]:
ages = [18, 88, 23, 97, 34]
result = ___________
for age in ages:
    if age > 65:
        _____________________
    elif age > 21:
        _____________________
    else:
        _____________________
print(result)

# Challenge 2: String Conditionals
Sometimes we have to separate first names from last names. Let's pick out last names starting with B.

To grab the last name, you can use the split method to seperate a string into a list of strings based on spaces. Here is an example:

In [None]:
president="Franklin D. Roosevelt"
name_split=president.split()
print(name_split)

In [None]:
presidents_full = ["George Washington", "John Adams", "Thomas Jefferson", "James Madison", "James Monroe", \
        "John Quincy Adams", "Andrew Jackson", "Martin Van Buren", "William Henry Harrison", "John Tyler", \
        "James K. Polk", "Zachary Taylor", "Millard Fillmore", "Franklin Pierce", "James Buchanan", \
        "Abraham Lincoln", "Andrew Johnson", "Ulysses S. Grant", "Rutherford B. Hayes", "James A. Garfield", \
        "Chester A. Arthur", "Grover Cleveland", "Benjamin Harrison", "Grover Cleveland", "William McKinley", \
        "Theodore Roosevelt", "William Howard Taft", "Woodrow Wilson", "Warren G. Harding", "Calvin Coolidge", \
        "Herbert Hoover", "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower", "John F. Kennedy", \
        "Lyndon B. Johnson", "Richard Nixon", "Gerald Ford", "Jimmy Carter", "Ronald Reagan", "George H. W. Bush", \
        "Bill Clinton", "George W. Bush", "Barack Obama"]

result = []
for president in presidents_full:
    name_split = _____________ # first, split the name like we demonstrated above
    last_name = _____________  # next, use list indexing to get the last name 
    start_letter = ___________ # use string indexing to get the first letter of the last name
    if _____________:          # fill in the conditional of the if statement to check if start_letter is "B"
        ______________         # add the last name to the accumulator list
        
print(result)

**Iterating with Dictionaries**

When you need to do several changes in your dictionary, loops can help.

For example, you could use a look to have your notebook print not just one but all the keys aong with their values in your dictionary:

In [None]:
#printpairs
d = {'Response_ID_1': 21, 'Response_ID_2': 19, 'Response_ID_3': 24, 'Response_ID_4': 20}

for key in d.keys():
    print(key, d[key])

Imagine these are the ages of different respondants in your survey. Now let's say you wish you correct all ages in this list by adding 2 years to each:

In [None]:
#add2years
d = {'Response_ID_1': 21, 'Response_ID_2': 19, 'Response_ID_3': 24, 'Response_ID_4': 20}

for key in d.keys():
    d[key] = (2 + d[key])

print(d)

The in operator works both for lists and dictionaries.

It will allow you to check if an element is contained within a list or a dictionary.

## Challenge

Using the dictionary below and a for loop, calculate how much it'll cost you to buy 2 pieces of each fruit.

In [None]:
d = {'apples': 0.49, 'oranges': 0.99, 'pears': 1.49, 'bananas': 0.32}

# Files
Learning Objectives:

* "Learn the Python way of reading in files."
* "Understand how to read/write text files and csv files."

Example file:

* We are going to use an example dataset retrieved from a lab in Cornell University.


Up until now, we have been creating variables directly in our code to represent our data. In actual social science research, however, you will most often be working with data that either you or someone else collected, which has been saved in a *file*, for example a text file or an excel spreadsheet.

In fact, most of the variables we have been talking about come from a real database collected in the College of Human Ecology in Cornell University.

It belongs to a social neuroscience project that looks at the relationship between social affordances and brain function, including risk/reward sensitivity and executive control.

We have created a file containing a subset of this data base for you with the following columns:
* Year
* Month
* Response_ID
* Hunger_Experience
* Moving_Times
* Self_Confidence
* Hometown
* Age
* Parental_Income
* Race_Ethnicity
* School_Year
* Transfer_Student
* College
* Gender_Identity
* Sex
* Living_Location

We will now use this file to demonstrate how to work with files in Python. Before we get started, it might be helpful to understand the actual survey questions that were used to create this data.


---



The first three items (year, month and Response_ID) are filled in automatically. Here are the prompts for the rest of the questions in this survey and their choices:

* **Hunger_Experience:** In the last 12 months, were you ever hungry but didn't eat because there wasn't enough money for food?

> -No

> -Yes

> -I don't know


* **Moving_Times:** How many times did you move before you turned 18? Please respond with a numeral. Do not include moving for college.

* **Self_Confidence:** When things don't go according to my plans, my motto is, "Where there's a will, there's a way."

>-Don't Know/Not Applicable

>-Disagree strongy

>-Disagree somewhat

>-Disagree slightly

>-Agree slightly

>-Agree somewhat

>-Agree Strongly


* **Hometown:** What is the name of your hometown (including city, state/province and zip/postal code)? For example, Ithaca, New York 14850. 

* **Age:** What is your age? (Enter a number)

* **Parental_Income:** What is your parents' income?

>-Below $40,000

>-$40,000 - $59,999

>-$60,000 - $99,999

>-$100,000 - $174,000

>-$175,000 - $299,999

>-$300,000 - $499,999

>-$500,000 - $749,999

>-More than $750,000


* **Race_Ethnicity:**  What is your race/ethnicity?

>-White (of European Descent)

>-Black

>-Hispanic

>-Asian

>-Pacific Islander

>-Middle Eastern

>-Mixed Race

>-Unknown


* **School_Year:** What is your year in school?

>-1st year undergraduate

>-2nd year undergraduate

>-3rd year undergraduate

>-4th year undergraduate

>-Graduate


* **Transfer_Student:** Are you a transfer student?

>-Yes

>-No


* **College:** Which College at Cornell are you in?

>-College of Agriculture and Life Sciences

>-College of Architecture, Art and Planning

>-College of Art and Sciences

>-SC Johnson College of Business

>-College of Engineering

>-College of Human Ecology

>-School of Industrial and Labor Relations

>-Faculty of Computing and Information Science

>-Cornell Law School

>-College of Veterinary Medicine

>-Weill Cornell Medicine


* **Gender_Identity:** What is your gender identity?

>-Man

>-Woman

>-Gender non conforming

* Sex: What is your sex?

>-Female

>-Male

>-Intersex


* **Living_Location:** Where do you live?

>-Off Campus

>-On Campus




# Reading from a file
Reading a file requires three steps:

1. Opening the file: open function
2. Reading the file: read function
3. Closing the file: close function

In [None]:
my_file = open('data/survey_responses_small.csv', "r") # open takes two arguments: the location of the file to open, and the mode (see below for details)
text = my_file.read()
my_file.close()

print(text)

However, use the with open syntax and this will automatically close files for you.
The 'r' indicates that you are reading the file, as opposed to, say, writing to it.

In [None]:
# better code
with open('data/survey_responses_small.csv', 'r') as my_file:
    text = my_file.read()
    
print(text)

#note that we are also reading the column names as the first line and we will deal with this issue later

# Reading a file as a list

Very often we want to read in a file line by line, storing those lines as a list.

To do that, we can use a for loop over the file object:

In [None]:
stored = []
with open('data/survey_responses_small.csv', 'r') as my_file:
    for line in my_file:
        stored.append(line)

In [None]:
stored

Remember that the variable name can be anything. It does not have to be "line". Files are simply always read line by line.

# Line breaks

The list we have just created looks pretty good, but there is something strange about it. You might notice that each string in the list ends with '\n'. Why is this happening? The answer is that '\n' is a special character, a *line break*, to indicate the ending of each line in a file.

We can use the strip method to get rid of those line breaks at the end.

The strip() method returns a copy of the string by removing both the leading and the trailing empty characters, such as spaces and line breaks (based on the string argument passed). Here are some demonstrations:

In [None]:
string = '  White (of European descent)   '
print(string.strip()) # no more extra spaces!

In [None]:
string = 'White (of European descent)\n'
print(string.strip()) # no more line break!

So if we want to get rid of the line breaks, we simply need to call strip() before adding each line to the list:

In [None]:
stored = []
with open('data/survey_responses_small.csv', 'r') as my_file:
    for line in my_file:
        stored.append(line.strip())


In [None]:
stored

# Read certain lines in file

* We can pick certain lines to read using the enumerate function

* We can use this method to drop the title line

* The enumerate function adds a counter to an iterable and returns it in a form of enumerating object. This enumerated object can then be used directly for loops or converted into a list of tuples using the list() method.

In [None]:
list1 = ["Time_Stamp","Year","Month"]

for ele in enumerate(list1):
    print (ele)

In [None]:
stored = []
with open('data/survey_responses_small.csv', 'r') as my_file:
    for i, line in enumerate(my_file):  # use enumerate to count which line we are on
        if i>0:
            stored.append(line.strip())
        else:
            continue # do nothing when identify the first line (i==1), i.e. do not read the first line

In [None]:
stored # now the title line has been removed!

# Excercise

Read from line 7 to line 20 (including both line 7 and line 20)

In [None]:
stored = []
with open('data/survey_responses_small.csv', 'r') as my_file:  
    for i, line in enumerate(my_file):
        if _______________: # fill in the blank
            stored.append(line.strip())
        else:
            continue 

In [None]:
print(stored)

# Writing to a file

We can use the with open syntax for writing files as well.

In [None]:
hometowns = ['Pasadena', 'Pensacola', 'Panama']
with open('output.csv', 'w') as new_file:
    for i in hometowns:
        new_file.write(i + '\n')

If you look in the file browser, you'll see that the new file has been created! Let's take a look at the file.