# Introduction to Python for Data Science

ASA DataFest 2024\
Southern Methodist University, Dallas, TX\
Instructor: Connor Brubaker\
Department of Statistics\
Texas A&M University

Installing and interacting with Python, basics of the Python language, essential Python data structures, elementary data wrangling, and basic data visualization.

**Follow the instructions on the home page of the GitHub repository this notebook is hosted at to install miniconda and required packages including Jupyter Lab.**

## Loading required packages

In Python, packages are loaded using the `import` command. Once you import a package, you now have access to the functions and modules contained in the package.

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

So now if you want to access the function `mean` contained in the numpy package, for example, you can type `np.mean()`.

In [None]:
x = [1, 2, 3, 4, 5]
np.mean(x)

## Basic Python data structures

There are three important data structures in base Python:

1. **List** - variable-length, mutable sequence of objects
2. **Tuple** - fixed-length, immutable sequence of objects
3. **Dictionary** - collection of key-value pairs

### Lists

Lists are mutable (can change) collections of objects. The objects of a list can be different types. Lists are instantiated using square brackets `[]`.

In [None]:
colors = ['red', 'green', 'blue']
print(colors)

In [None]:
a_list = [24, 'hours', 7, 'days']
print(a_list)

Python allows you to iterate over items of a list.

In [None]:
for item in colors:
    print(item)

You can access items in a list using square brackets `[]`:

In [None]:
colors[0]

Items are added to a list using either the `append` or `insert` methods.

In [None]:
print(colors)
colors.append('yellow')
print(colors)
colors.insert(2, 'purple')
print(colors)

Concatenation is done using the `+` operator.

In [None]:
new_list = [1, 2, 3] + [4, 5, 6]
print(new_list)

Nested lists are important for matrix-based operations common in data science.

In [None]:
X = [[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]]
for row in X:
    print(row)

In [None]:
Y = [[-3, -3, -3, -3], [-2, -2, -2, -2], [-1, -1, -1, -1]]
result = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]] # placeholder
for i in range(len(X)):
    for j in range(len(Y)):
        result[i][j] = X[i][j] + Y[i][j]
        
print(result)

### Tuples

Tuples are similar to lists, but they cannot be modified after they are created.

In [None]:
# create tuple
tup = (1, 2, 3) # tuples created using round brackets
tup

In [None]:
# access the first element
tup[0] 

In [None]:
# unable to modify items in the tuple
tup[1] = 3

### Dictionaries

Dictionaries consist of a collection of key-value pairs. These will be important when discussing DataFrames later on.

In [None]:
# create a dictionary with three key-value pairs
# separated by :
dict1 = {"a": "a string", 
         "b": [1, 2, 3, 4], 
         "c": (3, 4, 5)}
print(dict1)

In [None]:
# get item in dictionary by key
print(dict1["a"])

In [None]:
# get the 2nd element of the array with key 'b' 
print(dict1["b"])
print(dict1["b"][1])

In [None]:
# get the keys of the dictionary
print(dict1.keys())

In [None]:
# get the values of the dictionary
print(dict1.values())

In [None]:
# if ever in doubt on what the object type is, use the type() function
type(tup)
type(a_list)
type(dict1)

### Exercise 1: Creating a dictionary

Start by creating an empty dictionary. Then create 3 keys for the dictionary and assign them values based on below.

In [None]:
email = {} # create empty dictionary
email['subject'] = 'Your car\'s warranty is about to expire!'
email['from'] = 
email['to'] = 
print(email)

Create a new key called `cc` (carbon copy) and assign it a list of 3 fake email addresses.

In [None]:
# create the new field on the line below

print(email)

## Booleans and comparison operators

A boolean variable is one that can have only two possible values: `True` or `False`. Comparison operators are important for filtering data sets. Later, we will see that you can actually compare all values in the column of a data frame to a single value to perform filtering. Here are the Python comparison operators:

* `<` - less than
* `<=` - less than or equal to
* `>` - greater than
* `>=` - greater than or equal to
* `==` - equivalent to
* `!=` - no equivalent to

In [None]:
x = [5, 6, 7]
x[0] < 6

In [None]:
x[2] > 10

In [None]:
[3, 6, 8] == [3, 6, 8]

In [None]:
[2, 9, 4] == [2, 9, 5]

## Introduction to pandas

The `pandas` package has many functions for reading, manipulating, and wrangling data. There are two fundamental structures implemented in `pandas`:

* **Series** - a one dimensional array-like object with an associated array of data labels called its index.
* **DataFrame** - a rectangular table of data; it can be thought of as a collection of Series objects with a common index.

In [None]:
s = pd.Series([0, 3, 4, 6]) # with the default index
print(s)

In [None]:
# like with tuples and lists, you can access elements by integer index using []
print(s[1])

In [None]:
# create a series object with a custom index
s = pd.Series([2, 3, 4, 5], index=['a', 'b', 'c', 'd'])
print(s)

In [None]:
print(s['b'])

In [None]:
# data frames are creating using dictionaries
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
print(df)

In [None]:
print(df.index)

In [None]:
# column names can be accessed and changed
print(df.columns)

In [None]:
# you can modify column names
df.columns = ["STATE", "YEAR", "POP"]
print(df.columns)

In [None]:
# print the first 3 rows
df.head(3)

In [None]:
# print last 3 rows
df.tail(3)

In [None]:
# access an entire column
print(df['STATE'])

In [None]:
# a column is a Series
type(df['STATE'])

In [None]:
# access specific element of a column
print(df['STATE'][1])

## Reading data into a `DataFrame`

The `pandas` package contains a variety of functions for reading data into a `DataFrame` from a file depending on the file type. The most common format is the comma separated values (CSV) format for which the function `read_csv()` is used.

The data set we will work with here is a collection of data on hurricanes that made landfall in the mainland United States between 1950 and 2012.

In [None]:
hurricane = pd.read_csv("data/hurricane.csv")

In [None]:
# preview the data
hurricane.head()

In [None]:
# get the number of rows and variables
hurricane.shape

### Exercise 2: Working with Data Frames

Use the `type` function to determine the type of `hurricane` and verify it is a `DataFrame` object. Then go to the documentation on the `nunique` function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html
and use it to determine the number of unique elements in each column. What value do you have to specify for `axis`?

## Descriptive statistics

The `DataFrame` object has a variety of methods for computing descriptive and summary statistics. One way to get a bunch at the same time is using the `describe` method. A concise summary of these is found here: https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats

In [None]:
# notice that it only summarizes numerical columns
hurricane.describe()

We can also compute correlations and covariances within the data frame quite easily. For example, if we want to compute the correlation between two specific columns, we do the following.

In [None]:
# select the WindSpeed column then call the corr method 
# to compute its correlation with the Pressure column
hurricane['WindSpeed'].corr(hurricane['Pressure'])

In [None]:
# same thing but to get correlation between WindSpeed and 
# Damage2014 (the USD value of property damage in 2014 dollars)
hurricane['WindSpeed'].corr(hurricane['Damage2014'])

In [None]:
# you can even compute all cross correlations between all
# numerical columns at once
hurricane.corr(numeric_only=True)

In [None]:
# sometimes we need to know the unique values of a column as well
# as the number of times they show up 
unique_years = hurricane['Year'].unique()
unique_years

In [None]:
# how many unique years show up in the data?
len(unique_years)

In [None]:
# how many hurricanes in each year?
# only show the first few years
hurricane['Year'].value_counts().head()

## Grouping

Sometimes we are interested in computing summary statistics within levels of a factor or categorical variable. The `groupby` method of the `DataFrame` let's us do this.

In [None]:
# compute the total damage done by hurricanes in each year
hurricane.groupby('Year')['Damage2014'].sum()

In [None]:
# compute the total deaths associated with hurricanes in each year
hurricane.groupby('Year')['Deaths'].sum()

### Exercise 3: Grouping by year

Use the `groupby` method to determine the average wind speed of all hurricanes in each year.

In [None]:
# compute average WindSpeed of hurricanes in each year?


## Filtering rows

Sometimes we want to obtain a subset of the data such that the values of a particular column are within a certain range or satisfy some other criteria. This is achieved using indexing or filtering

In [None]:
hurricane['Year'] > 2000

In [None]:
hurricane_after2000 = hurricane[hurricane['Year'] > 2000]
hurricane_after2000.head()

In [None]:
hurricane_bw = hurricane[hurricane['Year'].between(1990, 2000)]
hurricane_bw.head()

The `loc` and `iloc` (integer location) functions are used to subset rows and columns at the same time.

In [None]:
deaths_after2000 = hurricane.loc[hurricane['Year'] > 2000, ['Year', 'Deaths']]
deaths_after2000.head()

In [None]:
hurricane.iloc[0, [0, 1]] # first row and first and second columns

### Exercise 4: Subsetting

Use the `loc` method to get just the name, year, and damage caused by hurricanes whose wind speed was at least 100 mph. 

### Exercise 4(b): Subsetting using `isin` (Bonus)

Use the `startswith` method in the same way as the `between` method above to get the subset of the data on just hurricanes whose name begins with "B" (note the hurricane names are capitalized). This will require that the column be cast to an array of strings first by calling `hurricane['Name'].str` and THEN using `startswith`.

In [None]:
hurricane[hurricane['Name'].str.startswith('B')]

## Joining Data Frames

Joining data frames is a good way to combine information. When joining two data frames, you need to join them on a column they have in common. This will be called the `ID` column. In this example, we will combine the `hurricane` data with yearly data on the ocean surface temperatures. The `Year` column will be the common `ID` column that we will join on.

Begin by defining the table on the left and the table on the right. There are four ways of joining two data frames:

* **INNER JOIN** - returns the rows where there is a match in the `ID` column of both tables.
* **LEFT JOIN** - this will return all rows in the left table even if there are no matches in the right table.
* **RIGHT JOIN** - this will return all rows in the right table even if there are no matches in the left table.
* **OUTER JOIN** - think of this as combining the results of the left and right join. This will return all rows from both tables and putting `None` for those missing matches.

Joining data frames is achieved using the `merge` function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

In [None]:
temps = pd.read_csv("data/temps.csv")
temps.head()

In [None]:
print(temps['Year'].unique)

In [None]:
print(hurricane['Year'].unique)

In [None]:
# perform an inner join to keep only the years present in both data frames
# here hurricane is the left table
pd.merge(hurricane, temps, on='Year', how='inner')

In [None]:
# perform a right join to keep all the temps even though some of those years
# are not in the hurricane data frame
pd.merge(hurricane, temps, on='Year', how='right').head(25)

### Exsercise 5: Joins

Using `hurricane` as the left table and `temps` as the right table, what would a left join do? Would this be equivalent to a outer join in this scenario? You might want to consult the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

## Basic Data Visualizations

We will cover creating `figure` objects in `matplotlib`. You can think of a `figure` as the canvas for the visualization you want to make. We will look at creating histograms and scatter plots.

In [None]:
from matplotlib import pyplot as plt

Figures are created using the `figure` function. After adding the plots to the figure, use the `show` function to display it.

In [None]:
plt.figure()
plt.show() # empty right now

Here is the full list of options available for a histogram: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

The most important ones are `bins`, `density`, and `color`.

In [None]:
plt.hist(hurricane['Damage2014']);
plt.show()

In [None]:
plt.figure()
plt.hist(hurricane['Damage2014'], bins=30, color='red', density=True);
plt.show()

Chart labels are added using `xlabel` and `ylabel`. Titles are added using `title`.

In [None]:
plt.figure()
plt.hist(hurricane['Damage2014'], bins=30, color='red', density=True);
plt.xlabel('Damages')
plt.ylabel('Density')
plt.title('Damanges in USD (relative to 2014)')
plt.show()

Scatter plots are made in a similar fashion using the `scatter` function: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

The most important options here are `s` (marker size), `c` (color), and `marker` (the style of the marker).

In [None]:
plt.figure()
plt.scatter(x='WindSpeed', y='Damage2014', data=hurricane, c='green', marker='+')
plt.xlabel('Wind Speed (mph)')
plt.ylabel('Damages (2014 USD)')
plt.title('Damages against Wind Speed')
plt.show()

Matplotlib offers an easy way to combine plots using the `subplots` function which creates and returns a `figure` and set of `axes` objects to add plots to. 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
ax[0].hist(hurricane['Damage2014'], bins=30, color='red', density=True);
ax[0].set_xlabel('Damages')
ax[0].set_ylabel('Density')
ax[0].set_title('Damanges in USD (relative to 2014)')

ax[1].scatter(x='WindSpeed', y='Damage2014', data=hurricane, c='green', marker='+')
ax[1].set_xlabel('Wind Speed (mph)')
ax[1].set_ylabel('Damages (2014 USD)')
ax[1].set_title('Damages against Wind Speed')
plt.show()

### Exercise 6: Time series plot

Time series plots can be made using the `plot` function with the time variable on the x-axis. Start by creating a new figure with two subplots aranged side by side using `plt.subplots()`. Then plot the wind speed over the years on the left and damanges over the years on the right. 