# Data Analysis with Python and Pandas 

---

In [None]:
import pandas as pd # importing the Pandas library

---

# Data Upload

The first step of data analysis is to actually get your data in the right place. Remember that using Colab is like borrowing someone else's computer. So, in order to upload our csv (called "accidents.csv" and "SternTech_UserData.csv") we need to: 

1. **Download the .zip file from the course repo / NYU Classes to your own computer,** 
2. **In Colab, click on the little arrow on the left-hand side of the screen,**

<div> 
    <img src="attachment:Colab%201.png" width=600 />
</div>

3. **Click on "Files" and then "Upload" to upload the unzipped .csv file,**

<div> 
    <img src="attachment:Colab%202.png" width=600 />
</div>

4. **Select "SternTech_UserData.csv" (again, the unzipped version) and click "open".**

---

### Now that I have my csv in the right place, I can "read it in" using pd.read_csv.

### First we are going to work with our 'SternTech_UserData.csv', a dataset of fake data that I created to illustrate the basic tenants of data analysis using Python and Pandas. 

### Below, we are setting our dataset equal to the variable 'df' (a commonplace variable name, standing for 'data frame'). 

In [None]:
# READ IN CSV 

# Primary Analysis of our Data

In [None]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

### `df.head()` will give us the first five rows of our data frame 
### `df.tail()` will give us the last five rows 
### `df.head(15)` will give us the first fifteen rows <br>

### `df.columns` will give us a list of all the column names in our data frame

### `df.dtypes` is going to tell us how the computer is interpreting our data (for instance, as a string, integer, float, et. cetera). Please note that in Pandas, "object" is, for all intents and purposes, the same as a "string" in Python.

### Let's drop that "unnamed" column because it's not going to do us any good.

In [None]:
# DROP UNNAMED COLUMN

### `df.describe` is going to give us the basic statistical metrics for our data frame

### `df.count()` is going to give us a count of the non-null cells in each column

### If we want to see the count of non-null cells for a particular column, we can use column indexing as such:

### Now, looking back at our `df.dtypes` result, we see that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. 

### Remember, in Python, how your data is being perceived (the dtype) determines what you can do with it. If we want to do any sort of timeseries analysis in the future, we're going to need to convert our time data from object to timestamp. So, let's change that using `pd.to_datetime`

In [None]:
# CONVERT TO TIMESTAMP

# Primary Analysis of our Data, Continued

### `df.sample()` is going to give us a random row from our data frame

### To select a single column from our data frame, we can use column indexing again. 

### To select multiple columns, we can use `.loc` notation. 

### Note that `.loc` notation is used when you're sorting by column names, whereas `.iloc` is used when you're sorting by index number. For instance:

### We can get the mean value of a column using `.mean()`

### We can also sort the values in our column using `df.sort_values(by=...)`

### If we want to find any rows where a certain condition holds true, we can use column indexing as well as a comparative (such as `<` or `>` or `=`).

---

# Exercise 1: How many 21 year-olds were served Culinary ads?

In [None]:
# your code here

---

# Exercise 2: What is the most common company size in the SouthEast?

In [None]:
# your code here

---

# Working with More JSON

### Moving on, let's look at a larger data set from https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam that details the leading causes of dath in NYC.

In [None]:
# GET JSON

### Again, we're going to use the requests library to read the json from the given URL. You'll note that there are two main fields returned in the json – the "meta" that just describes the actual metadata, and the data itself. 

In [None]:
# RESULTS KEYS

In [None]:
# DATA

### Now we'll create a DataFrame from our JSON again...

In [None]:
# CREATE DF FROM JSON

### And add some column names

In [None]:
# ADD COLUMN NAMES

### There's a lot of extraneous information in this dataframe, so we can drop a few of them. 

### Note that here we are passing in a list of columns that we'd like to drop, and specifying that we want to drop the columns themselves. If we said "axis=index" we would be dropping the rows themselves. 

### Also note that "inplace=True" specifies that instead of creating a new dataframe, we want to replace it with the current one (the one with fewer columns). This means that the new, smaller dataframe will persist across our entire notebook – aka, it implies that we want the change to be permanent. 

In [None]:
# DROP SOME COLUMNS

### It looks like our last three rows appear to be metadata and not actual data, so let's drop those rows as well.

In [None]:
# DROP FINAL THREE ROWS

### It's important to note that we can always rename our columns using a dictionary:

In [None]:
# RENAME COLUMNS USING DICTIONARY 

### We've spoken a bit about datatypes, and why it's important that our computer is viewing data as we need it to; for instance, a string as a string, an integer as an integer.  

### Remember that 'object' is a string in this case...

In [None]:
# DTYPES

### Let's change 'year' to an integer that way we can sort by year using `pd.to_numeric`:

In [None]:
# YEAR TO NUMERIC

### We can also pass the `errors` command to specify what should happen if we anticipate Pandas is going to object to one of our changes. From the [documentation of to_numeric](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html), we get:

* If ‘raise’, then invalid parsing will raise an exception
* If ‘coerce’, then invalid parsing will be set as NaN
* If ‘ignore’, then invalid parsing will return the input

In [None]:
# ERROR EXAMPLES

### Last but not least, we can also mark some variables as categorical

In [None]:
# CATEGORICAL SHIFTS

---

# Exercise 3: What was the leading cause of death in 2014?

In [None]:
# your code here

---

# Exercise 4: How many different causes of death were recorded in 2011?

In [None]:
# your code here