# Dealing with Data Spring 2022 – Class 7

---

## Introduction to Pandas and Matplotlib

Pandas is a library that allows us to do various data analytics activities on our data.

In [None]:
import pandas as pd # pandas = "python data analysis library' (https://pandas.pydata.org/)
import matplotlib  # 2D plotting library (https://matplotlib.org/)
import matplotlib.pyplot as plt # a plotting module within matplotlib
import seaborn as sns # statistical data visualization (https://seaborn.pydata.org/)

Let's get started with Data Frames, a table structure of rows and columns used in Pandas.

We begin by creating a new data frame using pd.DataFrame, passing through a list of dictionaries.


Let's dig a bit deeper to understand what just happened

So that's all a data frame is, it's a table of rows and columns! 

---

# ⭕ **QUESTIONS?**

---

Let's delve further into Pandas with a different data set. 

In [None]:
!rm -f restaurant.csv* # 'rm' = 'remove'
                            # '-f' means 'force', aka, it will bypass permission checks
                            # 'data/restaurant.csv*' means we want to remove any file in our data directory that ends with 'resturant.csv'
                            # in total, this command removes any prior file, if it exists

!curl 'https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD' -o restaurant.csv
                            # 'curl' is a tool to transfer eata from or to a server
                            # for more on 'curl' visit (https://curl.haxx.se/docs/manpage.html)

# !gzip data/restaurant.csv # compress the file

In [None]:
# changing the notebook settings to display more rows and columns by default

pd.options.display.max_rows = 1000 
pd.options.display.max_columns = 1000

Now that we have our data we can read in our csv using pd.read_csv.

In [None]:
restaurants = pd.read_csv('./restaurant.csv', 
                         encoding = 'utf-8', # for more on UTF-8 check (https://www.w3schools.com/charsets/ref_html_utf8.asp)
                         dtype = 'unicode', # we are telling Pandas to read our data as data type object 'Unicode' which will make it a string
                         parse_dates = True, # parse our dates that are coming in as strings, as specified above
                         infer_datetime_format = True, # we are asking Pandas to infer the format of the datetime strings in the column so as to increase parsing speed
                         low_memory = False) # normally Pandas will try to automatically detrmine the dtype, which takes lots of memory

---

Now that we have successfully read our CSV, let's look at some basics

For column definitions let's check out [the documentation](https://data.cityofnewyork.us/api/views/43nn-pn8j)

---

# ⭕ **QUESTIONS?**

---

Note that above all of our data is stored as a non-null object, aka, a string.  But 'Score' is not a string, it's a numerical value. So let's work to alter that in our data frame.

---

# ⭕ **QUESTIONS?**

---

Let's take a moment to explore what else we can customize in our histogram.

We can also use KDE (kernel density estimation) to estimate a continuous function, instead of bucketized as above.

---

# ⭕ **QUESTIONS?**

---

Now let's do some work with dates.

Note that our dates are stored as strings, which doesn't really help us. So, we can convert all of our dates using the 'to_datetime' function, and format them as illustrated below.

You'll note we get this odd error when converting to datetime whereby if there isn't a date, it gets defaulted to 1900-01-01 00:00:00. Let's work to get rid of any rows that have that value for their inspection date.

---

# ⭕ **QUESTIONS?**

---

# Exercise 1: Plot a histogram of our dates

In [None]:
# your code here

# Exericse 2: Change the number of bins in our histogram

In [None]:
# your code here

---

# ⭕ **QUESTIONS?**

---

Now that we've worked with our dates, let's look at categorical values.

Sometimes we need categorical values, when we have a variable that has an implicit order, for instance an 'ABC' grade (as we do in our restaurants data set).

In [None]:
restaurants["BORO"] =  pd.Categorical(restaurants["BORO"], ordered=False) 
restaurants["GRADE"] =  pd.Categorical(restaurants["GRADE"], categories = ['A', 'B', 'C'], ordered=True)
# 'ordered=True' means that we are saying there are three categories, and 'A' > 'B' > 'C', in that order
restaurants["VIOLATION CODE"] =  pd.Categorical(restaurants["VIOLATION CODE"], ordered=False)
restaurants["CRITICAL FLAG"] =  pd.Categorical(restaurants["CRITICAL FLAG"], ordered=False)
restaurants["ACTION"] =  pd.Categorical(restaurants["ACTION"], ordered=False)
restaurants["CUISINE DESCRIPTION"] =  pd.Categorical(restaurants["CUISINE DESCRIPTION"], ordered=False)

restaurants.dtypes

# Let's delve into a particular column, 'CUISINE DESCRIPTION'

# ⭕ **QUESTIONS?**

---

# Exercise 3: What are the 10 most common violation codes? 

In [None]:
# your code here

# Exercise 4: Plot the 10 most common violation codes as a bar chart

In [None]:
# your code here

# Exercise 5: Plot the numer of inspections across each bourough

In [None]:
# your code here

# ⭕ **QUESTIONS?**

---

# Imagine we want to get a subset of our data frame based on the columns we're interested in.

# What if instead we wanted to select the rows we're interested in? Well, to do that, we can generate a list of boolean (True or Fale) values, one for each row of our Data Frame, then use a list to see which rows to keep. 

# In this case, '04L' is the code for 'has mice'.


# Exercise 6: Which restaurants have the most mice complaints? 

In [None]:
# your code here

# Exercise 7: Let's pull up all of Subway's mice complaints

In [None]:
# your code here

# ⭕ **QUESTIONS?**

---

# Now let's do some work with Pivot Tables. First, let's count the number of restaurants inspected every day.

# Exercise 8: Plot the total number of inspections over 1 month

In [None]:
# your code here

# ⭕ **QUESTIONS?**

---

# We can also add some basic titles to our plot.

# Exercise 9: Create a pivot table where we break down the results by boro

In [None]:
# your code here

# ⭕ **QUESTIONS?**

---

# Let's now take some time to explore Matplotlib

# Note, there are lots of predefined styles available, too


# Let's break down all the possibilities with Matplotlib

In [None]:
fig = plt.figure(figsize=(10,6))

# Create the first subfigure
sub1 = fig.add_subplot(2,2,1)
sub1.set_xlabel('some random numbers')
sub1.set_ylabel('more random numbers')
sub1.set_title("Random scatterplot")
sub1.plot(np.random.randn(1000), np.random.randn(1000), 'r.')

# Create the second subfigure
sub2 = fig.add_subplot(2,2,2)
sub2.hist(np.random.normal(size=500), bins=15)
sub2.set_xlabel('sample')
sub2.set_ylabel('cumulative sum')
sub2.set_title("Normal distrubution")

# Create the third subfigure
numpoints = 100
x = np.linspace(0, 10, num=numpoints)
sub3 = fig.add_subplot(2,2,3)
sub3.plot(x, np.sin(x) + x + np.random.randn(numpoints), "r")
sub3.plot(x, np.sin(x) + 0.5 * x + np.random.randn(numpoints), "g")
sub3.plot(x, np.sin(x) + 2 * x + np.random.randn(numpoints), "b")
sub3.set_xlabel('x from 0 to 10')
sub3.set_ylabel('function value')

# Create the fourth subfigure
sub4 = fig.add_subplot(2,2,4)
x = np.random.randn(10000)
y = np.random.randn(10000)
sub4.hist2d(x,y,bins=100);
sub4.set_xlabel('x axis title')
sub4.set_ylabel('y axis title')

plt.tight_layout()
plt.savefig("normalvars.png", dpi=150)

# A bit more on what can be done...

In [None]:
# We can split multiple series into subplots with a single argument

variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                       'gamma': np.random.gamma(1, size=100), 
                       'poisson': np.random.poisson(size=100)})

variables.cumsum(0).plot(subplots=True,figsize=(10,6))

In [None]:
# Or, have some series displayed on secondary y-axis

variables.cumsum(0).plot(secondary_y='normal')

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i,var in enumerate(['normal','gamma','poisson']):
    variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum (normal)')
axes[1].set_ylabel('cumulative sum (gamma)')
axes[2].set_ylabel('cumulative sum (poisson)')

# Let's check out a new data set

---

# ⭕ **QUESTIONS?**

---

# Histograms

# Exercise 11: How do we divide our histogram into 30 bins? 

In [None]:
# your code here

# ⭕ **QUESTIONS?**

---

# Density Plots

## Rather than purely represent the underlying data, this is an _estimate_ of the underlying true distribution.

# Boxplots

## Think of a boxplot as viewing the data 'from above'. 

# Scatterplots

# We can even go so far as to assign variables to either the size or symbols of their colors

# Hexagonal Bin Plot

## This is perfect for when you have a larger number of points to display. It's also useful if your data are too dense to plot each point individually in a scatter plot.

---

# ⭕ **QUESTIONS?**

---