# Data Analysis with Python and Pandas 

Pandas is a Python package that ["allows for fast, flexible, and expressive data structures designed to make working with 'relational' or 'labeled' data both easy and intuitive.](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)

In [None]:
# import pandas

In this course, we will be using Pandas to do _a lot_. Today, we will use it for some preliminary data analysis. 

# Data Upload

The first step of data analysis is to actually get your data in the right place. Remember that using Colab is like borrowing someone else's computer. So, in order to upload our csv (called "SternTechUserData.csv") we need to: 

1. *Download the .zip file from Brightspace to your own computer,*

2. *In Colab, click on the file icon on the left-hand side of the screen, then on the upload icon below "Files" again on the left-hand side of the screen.*

3. *Upload the unzipped SternTechUserData.csv file.*

This dataset contains fake data that I created randomly to illustrate the basic tenants of data analysis using Python and Pandas. 

In order to read this dataset using Pandas, below we are setting our dataset equal to the variable 'df' (a commonplace variable name, standing for 'data frame'):

In [None]:
# read in the csv

---

# ⭕ **QUESTIONS?**

---

# Primary Analysis of our Data

In [None]:
pd.options.display.max_rows = 2000 
    
pd.options.display.max_columns = 50

### `df.head()` 
will give us the first five rows of our data frame 

### `df.tail()` 
will give us the last five rows 

### `df.head(15)` 
will give us the first fifteen rows <br>

In [None]:
# df.head()

In [None]:
# df.tail()

In [None]:
# df.head(15)

### `df.columns` 
will give us a list of all the column names in our data frame

In [None]:
# df.columns

### `df.dtypes` 
is going to tell us how the computer is interpreting our data (for instance, as a string, integer, float, et. cetera). Please note that in Pandas, "object" is, for all intents and purposes, the same as a "string" in Python.

In [None]:
# df.dtypes 

Let's drop that "unnamed" column because it's not going to do us any good.

In [None]:
# drop unnamed column

### `df.describe` 
is going to give us the basic statistical metrics for our data frame

In [None]:
# df.describe

### `df.count()` 
is going to give us a count of the non-null cells in each column

In [None]:
# df.count()

If we want to see the count of non-null cells for a particular column, we can use column indexing as such:

In [None]:
# value_counts()

---

# ⭕ **QUESTIONS?**

---

# A Note on Timestamps

Now, looking back at our `df.dtypes` result, we see that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. 

Remember, in Python, how your data is being perceived (the dtype) determines what you can do with it. If we want to do any sort of timeseries analysis in the future, we're going to need to convert our time data from object to timestamp. So, let's change that using `pd.to_datetime`

In [None]:
# df.dtypes

In [None]:
# to datetime

In [None]:
# df.head() # 

In [None]:
# df.dtypes

---

# ⭕ **QUESTIONS?**

---

# Primary Analysis, Continued

### `df.sample()` 
is going to give us a random row from our data frame

In [None]:
# df.sample()

# Indexing

To select a single column from our data frame, we can use column indexing again. 

In [None]:
# column index

# .loc

To select multiple columns, we can use `.loc` notation. 

_Note that `.loc` notation is used when you're sorting by column names, whereas `.iloc` is used when you're sorting by index number. For instance:_

In [None]:
# .loc

In [None]:
# .iloc

In [None]:
# .iloc[x,y]

---

# ⭕ **QUESTIONS?**

---

# `.mean()`

In [None]:
# .mean()

# `df.sort_values(by=...)`

In [None]:
# sort_values(by=)


# Conditional Indexing

In [None]:
# conditional

---

# ⭕ **QUESTIONS?**

---

# Exercise 1: How many 21 year-olds were served Culinary ads?

In [None]:
# your code here

# Solution

---

# Exercise 2: What is the most common company size in the SouthEast?

In [None]:
# your code here

# Solution

---

# ⭕ **QUESTIONS?**

---

# Working with More JSON

Moving on, let's look at a larger data set from https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam that details the leading causes of dath in NYC.

In [None]:
import requests

url = 'http://data.cityofnewyork.us/api/views/jb7j-dtam/rows.json'
results = requests.get(url).json() # reading in the json just as we did with our citibike info last week

Again, we're going to use the requests library to read the json from the given URL. You'll note that there are two main fields returned in the json – the "meta" that just describes the actual metadata, and the data itself. 

In [None]:
# results

In [None]:
# keys

In [None]:
# data

Now we'll create a DataFrame from our JSON again...

In [None]:
# create df

# Adding Column Names

# Dropping Data

There's a lot of extraneous information in this dataframe, so we can drop a few of them. 

Note that here we are passing in a list of columns that we'd like to drop, and specifying that we want to drop the columns themselves. If we said "axis=index" we would be dropping the rows themselves. 

Also note that "inplace=True" specifies that instead of creating a new dataframe, we want to replace it with the current one (the one with fewer columns). This means that the new, smaller dataframe will persist across our entire notebook – aka, it implies that we want the change to be permanent. 

In [None]:
# df.drop

It looks like our last three rows appear to be metadata and not actual data, so let's drop those rows as well.

In [None]:
# df.drop

It's important to note that we can always rename our columns using a dictionary:

In [None]:
# renaming_dict

---

# ⭕ **QUESTIONS?**

---

# Datatypes

We've spoken a bit about datatypes, and why it's important that our computer is viewing data as we need it to; for instance, a string as a string, an integer as an integer.  

Remember that 'object' is a string in this case...

In [None]:
# df.dtypes

Let's change 'year' to an integer that way we can sort by year using `pd.to_numeric`:

In [None]:
# to_numeric

# Errors

We can also pass the `errors` command to specify what should happen if we anticipate Pandas is going to object to one of our changes. From the [documentation of to_numeric](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html), we get:

* If ‘raise’, then invalid parsing will raise an exception
* If ‘coerce’, then invalid parsing will be set as NaN
* If ‘ignore’, then invalid parsing will return the input

In [None]:
# coercions

# Categorical Variables

Last but not least, we can also mark some variables as categorical

In [None]:
# categorical variables

---

# ⭕ **QUESTIONS?**

---

# Exercise 3: What was the leading cause of death in 2014?

In [None]:
# your code here

# Solution

---

# Exercise 4: How many different causes of death were recorded in 2011?

In [None]:
# your code here

# Solution

---