# Introduction to Data Analysis in Python

## Part II - Data analysis using the `pandas` library

### Instructor: Fred Feng

<hr style="height:2px; border:none; color:black; background-color:black;">

### Pandas is a flexible and powerful Python library for data analysis (https://pandas.pydata.org/).

To use pandas, we first need to import the `pandas` library.

### Loading a data file

#### Let's use a data set of police-reported road crashes that involves pedestrians in Wayne County, MI (2010-2018). 

Data source: 
https://www.michigantrafficcrashfacts.org/querytool/table/2#q1;2;2018;o82;0,37:1|2,3:1&p0,4:0,2:0,3:0,5:0,6:0,8:0,54:0,31:0,24:0,76:0,16:0,49:2,3:2,9:2,10||0|1000

#### The data file is `ped_crashes.csv`

First let's use the pandas `read_csv` function to read in the data set from the CSV (Comma-Separated Values) file.




We can take a quick look at the loaded data (called a "dataframe") using pandas method `.head()` or `.tail()`. 

It checks the first or last few rows of the dataframe.

Pandas `.shape` attribute returns the dataframe's dimensionality (numbers of rows and columns).

### Checking data types

The data set has texts and numbers. We need to check they are indeed what they should be in the dataframe.

Pandas `.dtypes` attribute returns the data type of each column. 

Pandas `.columns` attribute returns the column headers.

### Renaming column names
The original column names have spaces (and are long-ish), which may make the dataframe a little harder to work with. Let's rename them.

To rename columns names, we can assign a list of new column names to the dataframe's `columns` attribute. 

If we want to rename only one or a few of the columns, it's handy to use the `.rename` method. 

In [None]:
# 'year', 'month', 'day', 'timeofday', 'dayofweek', 'city', 'intersection', 'hitandrun', 
#                  'lighting', 'weather', 'speedlimit', 'worstinjury', 'partytype', 'age', 'gender'


### Checking missing values
In many real-world datasets you'll find missing values. These can interfere with data analysis processes, so we check for these first.

### Subsetting columns
To select a single column we can use something like either `crash.city` or `crash['city]`. 

To select multiple columns we can use something like `crash[['city', 'lighting', 'speedlimit']]`

### Value counts
Pandas `value_counts()` method gives us the number of times each value shows up in the given column (often called a pandas Series).

### Number of unique values
Pandas `nunique()` method gives us the number of unique values in a column

### Filtering a dataframe

### Creating a new column

### Deleting a column
Pandas `.drop` method allows us to delete a list of columns by name.

### Accessing a row
Pandas `.iloc` method allows us to access specified rows by position.

### Saving a dataframe to file

Pandas `to_csv` method writes the dataframe to a CSV file. 

## Exercises

### 1. Rename some of the column names in the `credit` dataframe  (e.g., from "speedlimit" to "Speed_Limit").

### Once done, check the dataframe to make sure they are indeed renamed.

### 2. Select any 3 columns of your choice, and save it as a new dataframe called "crash_new". 

### Once done, show the first rows of the new dataframe. 

### 3. What percentage of the crashes are fatal?

### Further resources on pandas

* On YouTube search <font color='red' >python pandas</font>: https://www.youtube.com/results?search_query=python+pandas


* Data Wrangling with pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


* Pandas official documentation: https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html