# Lesson 01 ... Welcome to pandas

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called `pandas` and in the command that follows, note that `pd` is just the alias that pandas assumes so that we can type `pd` and have all the `pandas` commands at our disposal.

In [1]:
import pandas as pd 
import numpy as np

The crime incident reports data are [available here]("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/tmpayw7hysb.csv") and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is `pd.read_csv()` and inside it is the URL for the comma-separated variable file. Once the file is downloaded by `pandas` we are saving it in python with the name `df` 

Note that data-sets, data-files are usually referred to as a `data-frame` in python and hence the alias of `df`.


In [None]:
df = pd.read_csv("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/tmpayw7hysb.csv")

Let us look at the first 5 rows of data to get a feel for the layout. The command is `.head(5)`

In [None]:
df.head(5)

What about the last 10 rows of the data?

In [None]:
df.tail(10)

Let us look at the contents of the data-frame ... 

| Column Name | Description |
| :--          | :--- |
| [incident_num] [varchar](20) NOT NULL, | Internal BPD report number |
| [offense_code][varchar](25) NULL,| Numerical code of offense description |
| [Offense_Code_Group_Description][varchar](80) NULL, | Internal categorization of [offense_description] |
| [Offense_Description][varchar](80) NULL, | Primary descriptor of incident |
| [district] [varchar](10) NULL,| What district the crime was reported in |
| [reporting_area] [varchar](10) NULL, | RA number associated with the where the crime was reported from. |
| [shooting][char] (1) NULL, | Indicated a shooting took place. |
| [occurred_on] [datetime2](7) NULL, | Earliest date and time the incident could have taken place |
| [UCR_Part] [varchar](25) NULL,| Universal Crime Reporting Part number (1,2, 3) |
| [street] [varchar](50) NULL,| Street name the incident took place |


Offense Codes are [available here](https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/3aeccf51-a231-4555-ba21-74572b4c33d6/download/rmsoffensecodes.xlsx)

We could also look at the offense codes by reading them in as a data-frame. This is an Excel file so we will have to switch to `.read_excel()`


In [None]:
offense_codes = pd.read_excel("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/3aeccf51-a231-4555-ba21-74572b4c33d6/download/rmsoffensecodes.xlsx")

In [None]:
print(offense_codes)

The next step would be to see how many data points we have, and what the minimum, maximum values, what is the average, etc. This can be done with `.describe()`

In [None]:
df.describe()

By default the command will report the values with decimals but we may not want that. Decimals can be `rounded` or removed altogether as shown below.

In [None]:
df.describe().round(2)

In [None]:
df.describe().round(0)

Note a few things here. 

* We have a total of 515082 incidents logged. But the latitude and longitude are availale for no more than 485909 incidents. 


Say we want to restrict the dataframe just to 2020. How can we do that?

In [None]:
df20 = df[ df['YEAR'] == 2020 ]

Notice the sequence here `dataframe[ dataframe['column-name'] == somevalue ]` & pay attention to the double equal sign `==` which is a strict equality. 

In [None]:
df20.describe()

At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

In [None]:
df20['OFFENSE_CODE'].nunique()

In [None]:
df20['OFFENSE_CODE'].value_counts()

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.   

In [None]:
# Just another way to accomplish the same thing but in a more complicated way.

df20.groupby('OFFENSE_CODE')['OFFENSE_CODE'].count().reset_index(name='count').sort_values(['count'], ascending = False) 

Let us focus in on these verbal disputes. We will do so by creating a new dataframe that is only for OFFENSE_CODE 3301.

In [None]:
dfverbal = df20[ df20['OFFENSE_CODE'] == 3301 ]

In [None]:
# Now we see this dataframe just to check

dfverbal

Which days of the week have more verbal disputes?

In [None]:
dfverbal['DAY_OF_WEEK'].value_counts()

Which hour, which streets have the most verbal disputes?

In [None]:
dfverbal['HOUR'].value_counts()

In [None]:
dfverbal['STREET'].value_counts()

What districts are the worst?

In [None]:
dfverbal['DISTRICT'].value_counts()

Lookup districts C11, B3, and B2 ... what areas are these?

# Practice Task 01

Pick another data-set from `data.boston.gov` and go through the same commands, picking some interesting element of the dataframe to explore

 ## What if we have local data, sitting in a folder called, data?¶

In [2]:
bm15 = pd.read_csv("data/marathon_results_2015.csv")
bm15.head()

Unnamed: 0.1,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,0,3,"Desisa, Lelisa",25,M,Ambo,,ETH,,,...,1:16:07,1:32:00,1:47:59,2:02:39,0:04:56,-,2:09:17,1,1,1
1,1,4,"Tsegay, Yemane Adhane",30,M,Addis Ababa,,ETH,,,...,1:16:07,1:31:59,1:47:59,2:02:42,0:04:58,-,2:09:48,2,2,2
2,2,8,"Chebet, Wilson",29,M,Marakwet,,KEN,,,...,1:16:07,1:32:00,1:47:59,2:03:01,0:04:59,-,2:10:22,3,3,3
3,3,11,"Kipyego, Bernard",28,M,Eldoret,,KEN,,,...,1:16:07,1:32:00,1:48:03,2:03:47,0:05:00,-,2:10:47,4,4,4
4,4,10,"Korir, Wesley",32,M,Kitale,,KEN,,,...,1:16:07,1:32:00,1:47:59,2:03:27,0:05:00,-,2:10:49,5,5,5


In [3]:
bm16 = pd.read_csv("data/marathon_results_2016.csv")
bm16.head()

Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 8,5K,...,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,5,"Hayle, Lemi Berhanu",21,M,Addis Ababa,,ETH,,,0:15:47,...,1:19:15,1:34:17,1:50:24,2:05:59,0:05:04,2:12:45,2:12:45,1,1,1
1,1,"Desisa, Lelisa",26,M,Ambo,,ETH,,,0:15:47,...,1:19:15,1:34:17,1:50:24,2:05:59,0:05:06,2:13:32,2:13:32,2,2,2
2,6,"Tsegay, Yemane Adhane",31,M,Addis Ababa,,ETH,,,0:15:46,...,1:19:15,1:34:45,1:50:48,2:06:47,0:05:07,2:14:02,2:14:02,3,3,3
3,11,"Korir, Wesley",33,M,Kitale,,KEN,,,0:15:46,...,1:19:16,1:34:45,1:50:48,2:06:47,0:05:07,2:14:05,2:14:05,4,4,4
4,14,"Lonyangata, Paul",23,M,Eldoret,,KEN,,,0:15:46,...,1:19:18,1:34:46,1:51:30,2:08:11,0:05:11,2:15:45,2:15:45,5,5,5


In [4]:
bm17 = pd.read_csv("data/marathon_results_2017.csv")
bm17.head()

Unnamed: 0.1,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,...,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,-,2:09:37,1,1,1
1,1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,...,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,-,2:09:58,2,2,2
2,2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,...,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,-,2:10:28,3,3,3
3,3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,...,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,-,2:12:08,4,4,4
4,4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,...,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,-,2:12:35,5,5,5


Notice the extra first column in `bm15` and `bm17` ... these should be dropped. 

We can then create an indicator for which year the race data are from, and then combine all three data-frames so that we have a single file. When we do this we will write it out as a `csv` file too.

In [5]:
bm15 = bm15.drop(columns = 'Unnamed: 0')
bm15.head()

Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,5K,...,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,3,"Desisa, Lelisa",25,M,Ambo,,ETH,,,0:14:43,...,1:16:07,1:32:00,1:47:59,2:02:39,0:04:56,-,2:09:17,1,1,1
1,4,"Tsegay, Yemane Adhane",30,M,Addis Ababa,,ETH,,,0:14:43,...,1:16:07,1:31:59,1:47:59,2:02:42,0:04:58,-,2:09:48,2,2,2
2,8,"Chebet, Wilson",29,M,Marakwet,,KEN,,,0:14:43,...,1:16:07,1:32:00,1:47:59,2:03:01,0:04:59,-,2:10:22,3,3,3
3,11,"Kipyego, Bernard",28,M,Eldoret,,KEN,,,0:14:43,...,1:16:07,1:32:00,1:48:03,2:03:47,0:05:00,-,2:10:47,4,4,4
4,10,"Korir, Wesley",32,M,Kitale,,KEN,,,0:14:43,...,1:16:07,1:32:00,1:47:59,2:03:27,0:05:00,-,2:10:49,5,5,5


In [6]:
bm17 = bm17.drop(columns = 'Unnamed: 0')
bm17.head()

Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,5K,...,25K,30K,35K,40K,Pace,Proj Time,Official Time,Overall,Gender,Division
0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,0:15:25,...,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,-,2:09:37,1,1,1
1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,0:15:24,...,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,-,2:09:58,2,2,2
2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,0:15:25,...,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,-,2:10:28,3,3,3
3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,0:15:25,...,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,-,2:12:08,4,4,4
4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,0:15:25,...,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,-,2:12:35,5,5,5


In [7]:
bm15['Year'] = '2015'
bm16['Year'] = '2016'
bm17['Year'] = '2017'

Now we bind the rows (i.e., stack one data-frame on top of the other with `concat( [dfi, dfj, dfk] ) 

In [8]:
bm_df = pd.concat([bm15, bm16, bm17])

In [9]:
bm_df.shape 

# How many rows and how many columns do we have? 
# This is a good step to check if we stacked the data-frames correctly or not

(79638, 26)

### How many runners per year?

In [10]:
bm_df['Year'].value_counts()

2016    26630
2015    26598
2017    26410
Name: Year, dtype: int64

### How many Male/Female runners per year?

In [14]:
bm_df.groupby('M/F')['Year'].value_counts()

M/F  Year
F    2016    12167
     2015    12017
     2017    11972
M    2015    14581
     2016    14463
     2017    14438
Name: Year, dtype: int64

In [20]:
# The proportion of Male/Female runners, by Year

bm_df.groupby('Year')['M/F'].value_counts(normalize = True)

Year  M/F
2015  M      0.548199
      F      0.451801
2016  M      0.543109
      F      0.456891
2017  M      0.546687
      F      0.453313
Name: M/F, dtype: float64