<div align="center">
<h1>
Python for Social Science Workshop - Lesson 2
</h1>
</div>
<br />
<div align="center">
<h3>
Jose J Alcocer
</h3>
</div>
<br />
<div align="center">
<h4>
April 11, 2023
</h4>
</div>

****

# Working With NumPy, Pandas, Matplot, and Seaborn <br>

<br>

## 1.0 Setting up the environment <br>
This lesson will be dense, as we will cover the most basic libraries and functions that are building blocks of the work we do as social scientists. Let's begin this lesson by importing the proper libraries we will be using throughout the document.

In [None]:
import numpy as np
import pandas as pd
import random

## 1.1 Loops and Conditional Statements <br>

Like in R, Python is able to handle several types of loops and conditional statements, which allow us to automate tasks and create more efficient code. <br>

<br>

### 1.1.1 For Loops <br>

With loops, it is possible to iterate over lists or a range of numbers alike to automate a task. We can see a few of these examples below.

In [None]:
# For loop using a list we create within the loop
for x in ['Apples','Bananas','Oranges','Pears','Pineapples','Mangoes','Grapefruits','Cantaloupes']:
    print('I like to eat '+x +'.')

In [None]:
# For loop using a list that already exists
list1 = list(random.sample(range(1,100),50))

# If you do not specify print here, the loop will still perform, but you will not see an output
for i in list1:
    print(i*2)

In [None]:
# For loop to using a range in order to add values to a list

# Creating a list
list2 = []

# For loop that says for every i from 0 to 9, multiply each value by 10 and append it to the list created. This is similar to the code used in R where you name the object and place a [i] next to the name of it
for i in range(0,10):
    list2.append(i*10)

print(list2)

### 1.1.2 While Loops <br>

Like in R, while loops are a bit more efficient, as they keep on running until the condition you state is met.

In [None]:
# Using a while loop to append a list

# Creating an object
x = 5
# Creating a list
y = []

# Telling python while the object is smaller than 50, append the object, x*10, to the list and add 5 to the initial object each time this task is successfully completed
while x < 50:
    y.append(x*10)
    print(x)
    x=x+5

print(y)

### 1.1.3 If and Else Statements <br>

If-Else statements are conditional arguements that tell Python to run under different set of conditions. If the first condition is met, then do one task; if not, do another.

In [None]:
# Basic If-Else statement
x = 0

if x!=0:
    print(1/x)
else:
    print('No reciprocal for 0.')

In [None]:
# Basic If-Else statement with different x value
x = 2

if x!=0:
    print(1/x)
else:
    print('No reciprocal for 0.')

### 1.1.4 Elif (Else if) in If-Else Statements <br>

'Elif' is an additional operator you can give to an if-else statement. It allows you to create more conditions and more tasks to do if two are not enough.

In [None]:
# Creating an object to represent Democratic presidential vote in a general election
district_vote = 53

if district_vote in range(40,61):
    print('This district is competitive')
elif district_vote in range(0,40):
    print('This is a safe Republican district')
else: print('This is a safe Democratic district')


In [None]:
# Creating an object to represent Democratic presidential vote in a general election
district_vote = 61

if district_vote in range(40,61):
    print('This district is competitive')
elif district_vote in range(0,40):
    print('This is a safe Republican district')
else: print('This is a safe Democratic district')

It's also possible to use nested loops and conditional statements in Python. For the sake of time constraints, here is just one example, but you can go on this [link](https://pynative.com/python-nested-loops/) to learn more about how to create them.

In [None]:
# Using the else argument within a while loop
counter = 0

while counter < 10: # part of while loop
    # loop will end/break once counter hits 10
    if counter == 10: # part of if-else
        break
    print('Inside loop') # part of while loop
    counter = counter + 1
else: # part of if-else
    print('Inside else')

## 1.2 NumPy (Numerical Python) Library <br>

So far, we learned about lists and tuples and how useful they are for working with multiple numeric and non-numeric observations in Python. However, what happens when you want to work large objects or even with two-dimensional objects (i.e., objects that contain rows and columns)? Native Pyton lists and tuples, while good, are not as efficient when you begin working with large amounts of data that might be stacked on top of eachother. That is where NumPy comes in. The NumPy library offers an object called an 'array' that can be stacked and can be used to compute several types of processes in a much faster and efficient manner. The main reason why this is possible is due to the arrays being written in C language, which allows them to be stored in contiguous memory locations within your machine, making them more accessible and easier to manipulate. NumPy also offers several statistical functions that allow us the ability to compute several analyses without having to use additional packages. Learning how NumPy operates is fundamental to being able to work with dataframes (will be discussed in the next section). <br>

<br>

### 1.2.1 Creating Arrays <br>

The syntax for creating arrays is `np.array()`, and it an be used to either convert an existing list or tuple into one, or create one from scratch. It is important to note that unlike lists and tuples, you cannot have multiple data types within an array, so it can either be strings or integers.

In [None]:
## Converting an existing list to a one dimensional numeric array

# Creating a list
list1 = [1,2,3,4,5,6,7,8,9,10]

# Creating an object that converts a list into an array
array1 = np.array(list1)
# Confirming that it is indeed an array with the type() function; conversely, look at the variables window
type(array1)

In [None]:
## Creating a one dimensional numeric array from scratch

array1 = np.array([1,2,3,4,5,6,7,8,9,10])
print(array1)
print(type(array1))

In [None]:
## Creating a one dimensional string array
array1 = np.array(["Hi","Hola","Salut","Ciao","Privet","Hallo","Oi","Anyoung","Ahlan","Hej","Hoi"])

print(array1)
print(type(array1))

In [None]:
## Creating a two-dimensional numeric array

# Brackets tell python to separate the arrays and make them two-dimensional
array1 = np.array([[1,2,3,4,5,6,7,8,9,10],
                  [11,12,13,14,15,16,17,18,19,20],
                   [21,22,23,24,25,26,27,28,29,30]])
print(array1)

While arrays cannot handle multiple data types, you can coerce Python into allowing you to include different types by telling Python to store all observations as an object. You would not be able to compute calculations with this array, but it is still cool to know you can do this.

In [None]:
array1 =np.array([[True, False, 'hello'],
                  ['apple', 33.7, (0,1)],
                  [37,40,50]], dtype=object)
print(array1)

### 1.2.2 Indexing Arrays <br>

Like lists and tuples, you are able to index specific observations from both one-dimensional and two-dimensional arrays. The indexing mechanism is the same as indexing in R, where the first coordinate refers to the y-axis (rows) and the second coordinate refers to the x-axis (columns). As a reminder, unlike R, Python begins its indexing with 0.

In [None]:
array1 = np.array([[1,2,3,4,5,6,7,8,9,10],
                   [11,12,13,14,15,16,17,18,19,20],
                   [21,22,23,24,25,26,27,28,29,30]])
# Indexing the number that is on the first row, third column
print(array1[0,2])

# Indexing the number that is on the third row, sixth column
print(array1[2,5])

# Indexing multiple values - first three values in first row | Python does not include the final value to give it
print(array1[0,0:3])

# Indexing multiple values - first values from each row | Python does not include the final value to give it
print(array1[0:3,0])

# Indexing all values from the array
print(array1[:,:])

### 1.2.3 NumPy Statistics <br>

The NumPy library includes functions that allow us to conduct basic statistics. There are three ways to calculate statistics when working with two-dimensional arrays: <br>
* Calculating statistics for the entire array;
* Calculating statistics for each column (must include 'axis=0' inside the function);
* Calculating statistics for each row (must include 'axis=1' inside the function). <br>

Basic statistical functions include but are not limited to: <br>
* `np.mean()` - calculates mean of an array object
* `np.sum()` - calculates sum of an array object
* `np.min()` - finds the min value of an array object
* `np.max()` finds the max value of an array object
* `np.std()` calculates the standard deviation of an array object
* `np.median()` - finds the median value of an array object
* `np.sort()` - sorts an array object in ascending order
* `np.sort()[::-1]` - sorts an array object in descending order
* `np.random.random(size= int)` - creates an array with random floats between 0 and 1
* `np.random.randint(int,int, size=int)` - creates an array of integers in any shape
* `np.random.randn(int)` - creates and returns a sample or samples from the standard normal distribution
* `np.random.shuffle()` - modifies the sequence of an array by shuffling it
* `np.count-nonzero()` - returns the count of non-zero elements in an array; useful when measuring sparsity

<br>

The following lines of code will show some examples of these functions and the variations that can be done using the 'axis' arguments.

In [None]:
# Creating array the way we've been doing so
array1 = np.array([[1,2,3,4,5,6],
                   [7,8,9,10,11,12],
                   [13,14,15,16,17,18]])

# Creating an array using numpy function arange and reshape functions
# You can use multiple functions by simply adding a period '.' so long as it makes sense to do so
# the agument is as follows (first number to start on, number to stop at (it will not actually include it), and by how many integers to skip by)
# very similar to the `seq()` function in R
array1 = np.arange(1,19,1).reshape(3,6)

print(array1)

In [None]:
## Finding the mean of the array above

# Entire array
print('The entire mean of the array is:', np.mean(array1))

# Mean of each column in the array - we should get an array of six values
print('The mean of each column in the array is:', np.mean(array1, axis=0))

# Mean of each row in the array - we should get an array of three values
print('The mean of each row in the array is:', np.mean(array1, axis=1))

In [None]:
## Finding the sum of the array

# Entire array
print('The sum of the array is:', np.sum(array1))

# Sum of each column in the array - we should get an array of six values
print('The sum of each column in the array is:', np.sum(array1, axis=0))

# Sum of each row in the array - we should get an array of three values
print('The sum of each row in the array is:', np.sum(array1, axis=1))

In [None]:
## Finding the median of the array

# Entire array
print('The median of the array is:', np.median(array1))

# Std of each column in the array - we should get an array of six values
print('The median of each column in the array is:', np.median(array1, axis=0))

# Std of each row in the array - we should get an array of three values
print('The median of each row in the array is:', np.median(array1, axis=1))

In [None]:
print(np.sort(array1))

In [None]:
# Creating random sample of floats from 0 to 1
array1 = np.random.random(size=1000)
print(array1)

In [None]:
array1 = np.random.randint(0,50, size=50)
print(array1)

Like in R, you can set a seed to ensure same iteration each time.

In [None]:
np.random.seed(444)
array1 = np.random.randint(0,50, size=50)
print(array1)

## 1.3 Pandas (Panel Python) Library <br>

As mentioned above, knowing how NumPy works is essential to programming in Python, as several libraries are built to work on top of NumPy. Pandas is one of those libraries. The Pandas library offers several data structures and operations for manipulating numeric data along with time series. It allows for the importing, creating, managing, and exporting of dataframes, making it the staple library for data science in Python. Pandas allows to create what can be called 'Pandas series' and 'Pandas DataFrames'. For this lesson, our main focus for Pandas will be DataFrames and how to create, import, and manipulate them.  <br>

<br>

### 1.3.1 Creating a Pandas Series <br>

A Pandas series is a simple one-dimensional array that can hold any datatype (e.g., integer, string, float, objects). A Pandas series is nothing more than a single column of data found in an excel sheet. Creating a series is as simple as creating a list, tuple, or one-dimensional array.



In [None]:
PdSeries = pd.Series([1, 2, 3, 4, 5,6,7,8,9,10])

print(PdSeries)

Like NumPy Arrays, a Pandas series can be indexed in using the brackets.

In [None]:
# Indexing the third value from the Pandas series
PdSeries[2]

### 1.3.2 Creating a Pandas DataFrame <br>

While one Pandas series may not be any more useful than NumPy arrays, several series can be combined into a Pandas DataFrame. A Pandas DataFrame is a two-dimensional tabular data structure with labeled rows and columns, which is the same as a DataFrame used in R, Excel, Stata, SQL, or SPSS. Creating a Pandas DataFrame is similar to creating a Python Dictionary or a DataFrame in R.

In [None]:
# Creating a DataFrame

df = pd.DataFrame({'Name':["Student A", "Student B", "Student C"],
                          'Year': ["Third Year", "Second Year", "Second Year"],
                          'Position':["Treasurer","Senator","President"]})

# Using the print function gives you an in-text DataFrame
print(df)

Calling the DataFrame without using the print function gives you an interactive table thanks to Data Spell. This feature is unique to the program and it allows us to view the DataFrame in a new window (Like in R), and even export the DataFrame into a csv file without having to write additional code.

In [None]:
df

You can also create a DataFrame from an existing two-dimensional array.

In [None]:
# Creating an array that has 10 rows and 5 columns
array1 = np.arange(1,100,2).reshape(10,5)

# Creating DataFrame and using 'columns' argument to assign names to the columns in DF
df = pd.DataFrame(array1, columns=['var1','var2','var3','var4','var5'])
df

We can also import a dataframe from an existing url that contains a csv file. For this example, we will be importing a csv file from the New York Times containing COVID-19 related cases. [Here](https://www.nytimes.com/interactive/2021/us/covid-cases.html) is the article showing COVID trends online, and [here](https://github.com/nytimes/covid-19-data) is the github repository where this dataset was found. Pandas supports reading different types of files. Here are some examples of those:

* `pd.read_csv()` - reads csv files
* `pd.read_excel()` - reads excel files
* `pd.read_stata()` - reads stata files
* `pd.read_sql()` - reads sql database files

In [None]:
# Reading dataframe from a link online
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')
df

### 1.3.3 Indexing DataFrames <br>

Indexing data from a DataFrame can be done in two ways. The first way is to index a DataFrame by calling the name of the variable with the choice of setting a condition.

In [None]:
# Indexing DataFrame to only give us data on the states
df['state']

In [None]:
# Indexing DataFrame to only give us data on the states and cases
# When indexing multiple variables, we need to include a second set of brackets
df[['state','cases']]

In [None]:
# Indexing DataFrame by slicing/telling Python to get specific rows
df[10:20]

### 1.3.4 Subsetting DataFrames <br>

Sometimes, we might only be interested in a particular subset of a DataFrame. Like in R, Pandas allows us to subset data from a DataFrame.

In [None]:
# Subsetting DataFrame to only give us the states that are california and nothing else
df[df['state']=='California']

We can also subset a DataFrame on negative conditions. This can be done so using the tilde operator `~`. Here, we are telling Python to subset us the DataFrame so it does not include California.

In [None]:
# Subsetting DataFrame to only give us all the states except California
df[~(df['state']=='California')]

If you want to make a new DataFrame out of this subset, you simply need to store in a new object. Additionally, if you want to reset the index, you can use the `.reset_index()` function to make the index start from 0 in this newly subsetted DataFrame.

In [None]:
df2 = df[~(df['state']=='California')].reset_index()
df2

#### Subsetting using the 'And' Operator

What if you want to subset based on multiple conditions? You can by including the '&' operator and placing the conditions in their own separate parentheses. Here, we are interested in only seeing the days that California had over 500 cases.

In [None]:
df = df[(df['state']=='California') & (df['cases']>500)].reset_index()
df

Likewise, we can use the tilde operator to ensure that we subset a DataFrame on several negative conditions

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

# Subsetting DataFrame to include all states that are not CA and all cases under 500
df = df[~(df['state']=='California') & ~(df['cases']>500)].reset_index()
df

#### Subsetting using the 'Or' Operator <br>

Like '&', you can also use the '|' operator and Python will subset based on multiple conditions it has pertaining to one variable of interest. In this example, we are interested in subsetting the DataFrame so it keeps California and New York.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

df[(df['state']=="California") | (df['state']=="New York")]

We can use the '|' operator to include more than two conditions from a variable. Here, we will keep four states.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

df[(df['state']=="California") | (df['state']=="New York") | (df['state']=="Washington") | (df['state']=="Texas")]

### 1.3.5 Transforming DataFrames <br>

Pandas allows us to make changes to DataFrames similar to how we can manipulate DataFrames in R. <br>

<br>

#### Creating/Replacing New Variables/Columns to DataFrame <br>

The following code snippets shows different variations of adding a new variable to the existing DataFrame.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

# Creating a new variable
df['death/case ratio'] = 0

# Like in R, you can use the head() function to give you the first 5 observations
df.head()

In [None]:
# Creating a new variable/replacing a current one with new values; in this case, we will create a variable that calculates the ratio of deaths to cases
# Python will calculate what you want it to and will do it per row
df['death/case ratio'] = (df['deaths']/df['cases'])

# Like in R, you can use the tail() function to get the last 5 observations
df.tail()

In [None]:
# Replacing observations from DataFrame across all columns from the 0th to 2nd observation
# Copying a df using the copy(); this ensures that changes are not affected across all dataframes
df2 = df.copy()

# Telling Python to replace all columns from the 0th row to the 2nd with 1000
df2.iloc[0:3,:]=1000

df2.head()

In [None]:
# Telling Python to replace first and second column from the 0th row to the 2nd with 1000
df2.iloc[0:3,0:2]='DataFrame'

df2.head()

In [None]:
# Telling Python to replace all observations in the fips column
df2.loc[:,'fips']='Python'

df2.head()

#### Dropping/Deleting Variables/Columns to DataFrame <br>

The following code snippets shows different variations of dropping variables from an existing DataFrame.

In [None]:
## Dropping last variable, death/case ratio
## Axis 1 is columns
## Inplace True means that these changes will be reflected in the dataframe; False means it will only be reflected in the code snippet output
df2.drop('death/case ratio', axis=1, inplace= True)
df2.head()

In [None]:
## Dropping the first three rows
## Axis is 0 for rows
df2.drop([0,1,2], axis=0, inplace= True)
df2 = df2.reset_index()
df2.head()

In [None]:
## Dropping the first 50 rows
## Axis is 0 for rows
df2.drop(range(0,50), axis=0, inplace= True)
df2 = df2.reset_index()
df2.head()

When we reset the index, the index that was in place, moves to the dataframe, so we must get rid of it.

In [None]:
df2.drop(['level_0','index'], axis=1, inplace= True)
df2.head()

#### Changing Variable Data Types in DataFrame <br>

Sometimes, you might need to change the type of a variable due to poor formatting or other reasons. Python allows us to change data types using the `.to_numeric()` or `.astype()` functions. The following code snippets shows different variations of changing data types.

To get a better sense of what data types we have within our DataFrame, we can use the `.dtypes` command to have Python give us this info for each variable. For this example, we will create a new variable that is a string and convert it to an integer.

In [None]:
df['var7'] = '0'

df.dtypes

We will change the 'var7' variable from an 'object' type to an 'integer' type.

In [None]:
df['var7'] = df['var7'].astype(int)

df.dtypes

As you can see, the 'var7' we created is now an integer. We can also turn it back into a string.

In [None]:
df['var7'] = df['var7'].astype(str)

df.dtypes

In DataFrames, a string is categorized as an object, so we know that if it says object, then it is a string type. <br>

<br>

#### Sorting and Grouping Data <br>

Sometimes, you might might want to organize your DataFrame by a particular variable or perform operations across groups. We can do this using the `.sort_values()` and `.groupby()` functions. The `.sort_values()` function can organize your DataFrame by columns of choice, and the `.groupby()` function can perform statistical operations by categorical groups. It can also tabulate data (similar to the `table()` function in R).

In [None]:
# Group DataFrame by state in ascending order
df.sort_values(by=['state'], inplace=True)

df.head()

In [None]:
# Group DataFrame by state in descending order
df.sort_values(by=['state'], inplace=True, ascending=False)

df.head()

The `.groupby()` function can be used in combination with other statistical operations to get particular answers that we might have.

In [None]:
# Grouping DataFrame by state and fips codes to get the total number of observations per state
df.groupby("state")['fips'].count()

In [None]:
# Grouping DataFrame by state and getting the sum of deaths
df.groupby("state")['deaths'].sum()

In [None]:
# Grouping DataFrame by state and getting the maximum number of deaths per state
df.groupby("state")['deaths'].max()

Let's group by state and year now. When working with time series data, we must always convert our date variables to a format that Python can interpret, 'datetime'. Using a code similar to converting data types, the function `pd.to_datetime()` allows us to convert an existing date variable into a datetime object. From there, we can extract specific parts of the date we are interested in. For this example, we will create a year variable out of the date variable we already have.

In [None]:
# Getting Data Types of our current DataFrame
df.dtypes

Our current data variable is categorized as an object, which does not allow us to use it for anything useful, let's convert it to datetime.'

In [None]:
# Converting current date to datetime type
df['date'] = pd.to_datetime(df['date'])

df.dtypes

Now that date is a datetime object, let's get a year variable out in order to group our DataFrame by state and year.'

In [None]:
# Creating year variable out of the date variable
df['year'] = df['date'].dt.year

# Grouping by state and year to get sum of deaths per year
df.groupby(['state','year'])['deaths'].sum()

If we want to perform multiple statistical operations, we can use the `.agg()` command to tell Python to place them all in one output.

In [None]:
# Grouping by state and year to get sum of deaths per year, highest amount of deaths per year, and average deaths per state
df.groupby(['state','year'])['deaths'].agg(['sum','max','mean'])

#### Concatenating (Rows) and Merging (Columns) Data <br>

A final task we will be covering in relation to data handling is appending datasets. Python gives us the `concat()` and `merge()` functions to allow us to combine multiple data sources into a single dataset. Let's use the join function first. <br>

Before concatenating or merging anything, we will be creating two new DataFrames out of our covid dataset. We will subset CA and Texas into their own DataFrames, then bring them back together.

In [None]:
# Creating new df in order to subset it to include CA and TX separately
df_ca = df.copy()
df_tx = df.copy()

# Subsetting df to get CA
df_ca = df_ca[df_ca['state']=='California']
df_tx = df_tx[df_tx['state']=='Texas']

# Subsetting CA to split variables
df3 = df_ca[['date','state','cases']].reset_index(drop=True)
df4 = df_ca[['date','deaths','year']].reset_index(drop=True)

# Sorting new df by date
df_ca.sort_values(by=['date'], inplace=True)
df_tx.sort_values(by=['date'], inplace=True)
df3.sort_values(by=['date'], inplace=True)
df4.sort_values(by=['date'], inplace=True)

# Resetting index | drop = true ensures we don't have an index variable in the df
df_ca = df_ca.reset_index(drop=True)
df_tx = df_tx.reset_index(drop=True)
df3 = df3.reset_index(drop=True)
df4 = df4.reset_index(drop=True)

# Dropping unwanted variables
df_ca.drop(['death/case ratio','fips','var7'], axis=1, inplace=True)
df_tx.drop(['death/case ratio','fips','var7'], axis=1, inplace=True)

Because we are interested in joining two datasets by rows, we will be using the `concat()` function. Setting 'ignore index = True' will tell Python not to create an additional column of an index it dropped as a result of the concatenation.

In [None]:
df_both_states = pd.concat([df_ca, df_tx], ignore_index=True)
df_both_states.head()

If we wanted to join two datasets based on columns, unlike the concat that joins on rows, we would need to use the `merge()` command that allows to combine datasets horizontally. The `merge()` merges two datasets by keywords (variables). In addition, it comes with a few options for its "how=" argument: <br>
* "how = inner" - it joins only existing pairs
* "how = outer" - it joins all observations (expect a few NaNs as a result)
* "how = left" - it joins with the calling dataset's index
* "how = right" - it joins with the second dataset's index

We will be merging with an inner option to ensure that both datasets are matched.

In [None]:
df5 = df3.merge(df4, on="date", how='inner')
df5.head()

## 1.4 Using Matplot <br>

The Matplot library is one that allows us to create basic plots out of arrays in Python. We can use this library package to create line, scatter, and bar plots to name a few.

<br>

We will start by importing the `matplotlib` lib package below.

In [None]:
import matplotlib.pyplot as plt

### 1.4.1 Creating Plots Out of Lists (Vectors) <br>

In order to demonstrate its capabilities, let's create a simple set of variables that will be used to plot our X and Y axis.

In [None]:
Y = [100,200,300,400,500,600,700,800]
X = [2016,2017,2018,2019,2020,2021,2022,2023]

Now that we created our variables, let's begin by plotting our data using a line plot, followed by a scatter plot.

`plt.plot()` is the command for a line graph

In [None]:
# plotting the data
plt.plot(X, Y)

# Adding a  title to our plot
plt.title("Line Plot")

# Adding labels to our plot
plt.ylabel("y-axis")
plt.xlabel("x-axis")
plt.show()

`plt.scatter()` is the command for a scatter plot.

In [None]:
# plotting the data
plt.scatter(X, Y)

# Adding a  title to our plot
plt.title("Scatter Plot")

# Adding labels to our plot
plt.ylabel("y-axis")
plt.xlabel("x-axis")
plt.show()

`plt.bar()` is the command for a bar plot.


In [None]:
# plotting the data
plt.bar(X, Y)

# Adding a  title to our plot
plt.title("Bar Plot")

# Adding labels to our plot
plt.ylabel("y-axis")
plt.xlabel("x-axis")
plt.show()

### 1.4.2 Adding Additional Plot Arguments <br>

Like in GGPlot, you are able to make additional adjustments to the plots we make–such as changing the color of our lines, adding markers alongside lines, and even changing the linestyle. The following links below provide information to the kinds of arguments you can give to the `plot()` function with respect to the aesthetics. <br>

* [Colors available](https://i.stack.imgur.com/lFZum.png)
* [Markers available](https://matplotlib.org/stable/api/markers_api.html)
* [Linesytles available](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html) <br>

The following code shows an example of how to integrate the arguments into the `plt.plot()` function.

In [None]:
# plotting the data once more
plt.plot(X, Y, color='coral', marker='o', linestyle='dashed'
         )

# Adding a  title to our plot just as before
plt.title("Line Plot")

# Adding labels to our plot
plt.ylabel("y-axis")
plt.xlabel("x-axis")
plt.show()

### 1.4.3 Plotting Variables From A DataFrame <br>

The examples above show how to plot simple vectors, but what if we want to plot data from a DataFrame? We can use `.plot()` to plot specific variable from a DataFrame as well. <br>

<br>

Let's start by using the COVID dataset we imported earlier.

In [None]:
# Reading dataframe from a link online
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

# Grouping DataFrame by state and cases to get the total number of COVID cases per state
df = df.groupby("state")['cases'].count()

# Keeping only the first 10 states in the data frame
df = df[0:10]

For this example, we will be plotting the cumulative frequency distributions of COVID cases per state. Unlike the code above, we will use the `.plot()` function in a way that allows us to specify the type of plot and data objects we will be using.

In [None]:
# plotting the COVID data
# 'kind' tells Python to plot a bar graph
# 'width' tells python to alter the width of the bars; default is set to 0.8
df.plot(kind='bar', x='state', y='cases', color='y', width=0.4)

# adding title
plt.title("Frequency of COVID Cases Per State Plot")

# adding axis-labels
plt.ylabel("Cases")
plt.xlabel("State")

## 1.5 Using the Seaborn Library <br>

While Matplot lib is a powerful library, its modularity can provide a steep learning curve. Luckily, Python offers another package that facilitates plot making. Seaborn, like Matplot, is also used for plotting graphs, and it builds on the Matplotlib, Pandas, and Numpy libraries to do so. Its simpler syntax allows users to quickly pick up on plotting and creating aesthetic graphs to display relationships of data. The remainder of the workshop lessons will mainly rely on Seaborn to produce graphs.

<br>

We will start by importing the `seaborn` package below.

In [None]:
# Unlike the packages above, our program does not have seaborn internally, so we must use the `conda install` operator to download the library in order to use it. For easier downloading, remove the hash below and click on the icon that appears above the "conda install seaborn" line of code and click on install. After, it should be easy to import the seaborn package.
# conda install seaborn
import seaborn as sns

To show its power, let's use a simple example where we observe the average number of yearly COVID cases in California.

In [None]:
# Reading dataframe from a link online
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv')

# Subsetting df to only include CA
df = df[(df['state']=="California")]

# Converting current date to datetime type
df['date'] = pd.to_datetime(df['date'])

# Creating year variable out of date variable
df['year'] = df['date'].dt.year

# Organizing data even further to remove cumulative deaths per day
df['cases_(-1)']=df['cases'].shift(1)
df['non_cum_cases']= df['cases']-df['cases_(-1)']

The following is the simplest way to plot a plot using seaborn. Like in Matplot, Seaborn allows us to produce line plots, barplots, box plots, scatter plots, kernel density plots, regression plots, etc.

In [None]:
sns.lineplot(data=df, x='year', y='non_cum_cases')

Like in Matplot, we are able to give seaborn additional arguments to make our plots more customizable. The `sns.set_style` tells Python to set a preset theme for the plot we will be using. Some of the presets that seaborn has available are: <br>
* 'whitegrid'
* 'darkgrid'
* 'white'
* 'dark'
* 'ticks' <br>

In [None]:
# Setting theme style
sns.set_style('ticks')

# ci as 'False' removes the confidence intervals
# linestyle changes the style of the line
# color changes the color of the line
plot = sns.lineplot(data=df, x='year', y='non_cum_cases', color='y', linestyle = 'solid', ci=False)

# Adding title and labels
plot.set_title('Average COVID Cases per Year (CA)', fontdict={'size': 18, 'weight': 'normal'})
plot.set_xlabel('Year', fontdict={'size': 12})
plot.set_ylabel('COVID Cases (Avg)', fontdict={'size': 12})

# Saving figure to your directory
fig = plot.get_figure()
fig.savefig('output.png')