## Notebook 2: Basic Python Programming Skills


## 1. Introduction

The goals for today's lecture are to work through some basic features of Python programming, and explore different variable types.  There is SO much to learn, and so many ways of doing the same thing but with different codes, that we're going to focus on the basics for inferential data analysis.  

### 1.1 The Python Difference

One of the challenging things about using open source technologies is that it is rarely presented as a "complete" software package.  If you're working in Excel, you don't need to go outside of the program to insert a "square root of the sum of squares" equation, for example.  It's a function in Excel. With open source software, different functions are created by different programmers, and we often have to "call" in that external program to do what we want. These are called libraries.

Some important libraries in Python are:

>**numpy**: used for math and logic operations.

>**pandas**: used for the storing and basic handling of data.

>**matplotlib**: used for data visualization, creating plots, graphs, etc.

>**math**: from datascience, a collection of math functions

We install these libraries with the following commands. The abbreviation will be what we use to "call" functions that belong to that library.  As we start to get more sophisticated, we'll call ever more libraries into our Python notebooks. 

### NOTE: If you are working in the desktop version of Python, you will have to install the libraries first. 

In [None]:
import numpy as np
import pandas as pd
import math
import seaborn as sns
import matplotlib as plt
%matplotlib inline


pd.options.display.float_format = '{:.2f}'.format

### 1.2 Introduction to PANDAS

<img src="panda.jpeg" width="300">

No, not those pandas.  In Python, the word "panda" is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. With pandas, we can clean, transform and analyze our data.  

The building block of pandas are "series".  A dataframe is a multi-dimensional table made up of a collection of series.

<img src="Series.png" width="300">

In [None]:
#Here, I'm going to use a dictionary to assign values to my "apples" and "oranges" data
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
print(data)

In [None]:
#now, I'm going to convert them into a pandas dataframe.
df = pd.DataFrame(data)
df

#### Note how Python has indexed my values - you can think of a dataframe as an Excel worksheet.  Just like Excel, it is assigning row numbers to each observation, but in this case, the row numbers start with 0 rather than 1.  

## 2.0 Bringing a CSV file into Python

Python can read in multiple forms of data, but the most common is a .csv file ("comma separated values").  We can easily import .csv data into Python.  The "pd." tells Python to call up the panda function (this is like vocabulary - something to learn), to read the file as a csv, the name of the file, and that the delimiter is a comma. (A delimiter is what separates each of the columns, or array values, from one another.)

(We are able to put the name of the CSV file alone as the parameter rather than the full file path since our file is in the same folder as our Python file.)

In [None]:
pd.read_csv('CHISextract2022.csv', delimiter = ',')

Right now, all we've done is read the .csv, but what we want is to bring the data into Python so we can manipulate the columns. Similar to an Excel worksheet, pandas calls this database a "dataframe". Programmers often use df to signal that the data are in a dataframe - we're going to follow this convention, and assign our .csv the name "chis_df". 

In [None]:
chis_df = pd.read_csv('CHISextract2022.csv', delimiter = ',')

In [None]:
#let's take a look at our data, by default, this will give us 5 rows
chis_df

In [None]:
#but we can also specify how many rows we want to look at
chis_df.head(10)

In [None]:
#we can get information about our dataset by calling the "info()" function
chis_df.info()

### 2.1 Renaming Columns

Great, now we have our data in Python.  The first step of any project involving disaggregate data is learning a bit more about each of our variables.  It can also be helpful to rename columns, so we don't have to keep referring to the codebook for unique numbers.  Here is a minicodebook for the data for today:


> AE_VEGI: Number of times respondent eats vegetables per week

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> AK22_P1: Household Income

> AM184: How Often Worry about Paying Rent/Mortgage
(1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)



#### Helpful programming tip: the longer your variable names, the more likely you'll make a typo, meaning your code won't run.  Python is also CASE SENSITIVE, so whether you capitalize something matters for how Python reads it.  In renaming my columns, I try and keep my names simple and short, and don't include an capital letters.

In [None]:
chis_df.columns

In [None]:
chis_df.rename(columns={"AE_VEGI":"ate_veg",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry"}, inplace=True)

In [None]:
chis_df

## 3. Selecting Rows and Columns from your Dataframe

One thing I found very confusing about the Python programming language and pandas was the use of brackets and "selecting" data using slicing.  Let's look at some different ways of selecting rows and columns.  We can think of this as similar to when we select with columns or rows we want to work with in Excel.

In [None]:
#In Python, when you want to select a column or row, you are going to use [] 
#to indicate your selection.

print(chis_df['ate_veg'])

In [None]:
#Note, however, that this just returns a single series, not a dataframe - 
#you can think of it as if took out a column of excel and pasted it as a string of numbers in a word document
type(chis_df['ate_veg'])

As you can see, it is no longer a dataframe, or "worksheet" - it's just a list or series of numbers.  If what I really want to do is create an extract (e.g., copy the column into a new worksheet), I need to both "slice" the dataframe--which is also done with brackets--and then within the dataframe "select" the columns I want.

In [None]:
veg_df=(chis_df[['ate_veg']])

In [None]:
#now I have a new dataframe with just the variable ate veggies in it
veg_df

In [None]:
#we're going to create a new dataframe with just our variables for today
chis_df_small=(chis_df[['ate_veg','sex', 'race_ethnicity', 'pov_cat', 'hh_inc', 'housing_worry',]])

In [None]:
chis_df_small

## 4. Exploring Variables

Okay!  Now let's start exploring each the variables in the dataset. A nice way to begin is to use the "describe" function to look at the distribution of our variables.  Let's take a look!

In [None]:
chis_df_small.describe()

### 4.1 Numeric variables

Numeric variables refer to any variable that includes numbers, either integers (1, 4, 300) or floats (1.6, 4.56, 300.1543). When we work with raw numeric data, we want to explore their "distribution" - what is the mean and standard deviation?  What is the smallest value?  What is the largest value?

In [None]:
#Just as with the describe function above, we can ask to describe a single variable
chis_df_small['ate_veg'].describe()

In [None]:
#I really like looking at the distribution of my variable visually - it helps me see what's going on. 
#Histograms are a powerful way of assessing the distribution of a numeric variable
#To figure out all the options for making a histogram, I will rely on online information about how to make a histogram
#https://seaborn.pydata.org/generated/seaborn.histplot.html
sns.histplot(data=chis_df_small, x="ate_veg", binwidth=5)

In [None]:
#because there aren't too many discrete values, we can also just plot the full range of answers
sns.countplot(data=chis_df_small, x='ate_veg')

In [None]:
import matplotlib.pyplot as pyplt
pyplt.figure(figsize=(15,5))
sns.countplot(x="ate_veg",data=chis_df_small)

###  4.2 Nominal Binary variables

In addition to numeric variables, we also often have to work with "nominal" variables (those with a "Name").  Note that in the CHIS data, the variables that are nominal (e.g. sex, race/ethnicity) are actually assigned number values rather than strings.   

Let's start with looking at the "sex" variable.  It has two possible values, "Male" and "Female".  This is known as a binary or dichotomous variable.  But, even though they are represented by the numbers 1 and 2, we can't treat them as numbers - e.g., adding 2 males together doesn't give us a female.

    SRSEX: Self-reported Sex (1= Male, 2=Female)


In [None]:
chis_df_small[["sex"]].head(10)

In [None]:
#A simple way to look the distribution of a binary variable is to request the value_counts()
chis_df_small[['sex']].value_counts()

In [None]:
#another approach is to use the "crosstab" function available in pandas.  We'll be using crosstabs a lot when we do ttests,
#so let's look at a simple example for now
pd.crosstab(index=chis_df_small['sex'], columns="Total")

In [None]:
#let's get the percents
pd.crosstab(index=chis_df_small['sex'], columns="Total", normalize=True)

In [None]:
#we can also use the plot function above to look at the distribution visually
sns.countplot(data=chis_df_small, x='sex')

###  Categorical variables

A more complicated type of "nominal" variable is one where we have more than 2 categories - we find these all the time in planning surveys!  (And most are ordinal, meaning that the numbers assigned move either up or down in some logical way.)

   AM184: How Often Worry about Paying Rent/Mortgage (1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)

In [None]:
pd.crosstab(index=chis_df_small["housing_worry"], columns="Total")

In [None]:
sns.countplot(data=chis_df_small, x='housing_worry')

In [None]:
# Great work!!!  Explore some of the other variables on your own!