# Lesson 01 ... Welcome to pandas

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called `pandas` and in the command that follows, note that `pd` is just the alias that pandas assumes so that we can type `pd` and have all the `pandas` commands at our disposal.

In [21]:
import pandas as pd 
import numpy as np

The crime incident reports data are [available here]("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/tmpayw7hysb.csv") and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is `pd.read_csv()` and inside it is the URL for the comma-separated variable file. Once the file is downloaded by `pandas` we are saving it in python with the name `df` 

Note that data-sets, data-files are usually referred to as a `data-frame` in python and hence the alias of `df`.


In [22]:
df = pd.read_csv("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/tmpayw7hysb.csv")

  interactivity=interactivity, compiler=compiler, result=result)


Let us look at the first 5 rows of data to get a feel for the layout. The command is `.head(5)`

In [23]:
df.head(5)

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,TESTTEST2,423,,ASSAULT - AGGRAVATED,External,,0,2019-10-16 00:00:00,2019,10,Wednesday,0,,RIVERVIEW DR,,,"(0.00000000, 0.00000000)"
1,S97333701,3301,,VERBAL DISPUTE,C6,915.0,0,2020-07-18 14:34:00,2020,7,Saturday,14,,MARY BOYLE WAY,42.330813,-71.051368,"(42.33081300, -71.05136800)"
2,S47513131,2647,,THREATS TO DO BODILY HARM,E18,530.0,0,2020-06-24 10:15:00,2020,6,Wednesday,10,,READVILLE ST,42.239491,-71.135954,"(42.23949100, -71.13595400)"
3,I92102201,3301,,VERBAL DISPUTE,E13,583.0,0,2019-12-20 03:08:00,2019,12,Friday,3,,DAY ST,42.325122,-71.107779,"(42.32512200, -71.10777900)"
4,I92097173,3115,,INVESTIGATE PERSON,C11,355.0,0,2019-10-23 00:00:00,2019,10,Wednesday,0,,GIBSON ST,42.297555,-71.059709,"(42.29755500, -71.05970900)"


What about the last 10 rows of the data?

In [24]:
df.tail(10)

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
515072,102095489,3115,,INVESTIGATE PERSON,E18,520.0,0,2019-11-25 16:30:00,2019,11,Monday,16,,HYDE PARK AVE,42.256215,-71.124019,"(42.25621500, -71.12401900)"
515073,102091671,2647,,THREATS TO DO BODILY HARM,B3,417.0,0,2019-11-12 12:00:00,2019,11,Tuesday,12,,MORA ST,42.282081,-71.073648,"(42.28208100, -71.07364800)"
515074,20224065,3018,,SICK/INJURED/MEDICAL - POLICE,B2,282.0,0,2020-03-19 07:30:00,2020,3,Thursday,7,,WASHINGTON ST,42.353272,-71.173738,"(42.35327200, -71.17373800)"
515075,20202856,2672,,BIOLOGICAL THREATS,B2,282.0,0,2020-03-19 08:30:00,2020,3,Thursday,8,,WARREN ST,42.328234,-71.083289,"(42.32823400, -71.08328900)"
515076,20063425,3114,,INVESTIGATE PROPERTY,A7,21.0,0,2020-09-01 00:00:00,2020,9,Tuesday,0,,PARIS ST,42.374426,-71.035278,"(42.37442600, -71.03527800)"
515077,20062356,3115,,INVESTIGATE PERSON,E18,520.0,0,2020-08-28 18:39:00,2020,8,Friday,18,,HYDE PARK AVE,42.256215,-71.124019,"(42.25621500, -71.12401900)"
515078,20054040,3501,,MISSING PERSON,C11,,0,2020-07-30 15:30:00,2020,7,Thursday,15,,GIBSON ST,42.297555,-71.059709,"(42.29755500, -71.05970900)"
515079,20046400,1501,,WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR...,B2,330.0,0,2020-07-02 01:38:00,2020,7,Thursday,1,,PASADENA RD,42.30576,-71.083771,"(42.30576000, -71.08377100)"
515080,20038446,1501,,WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR...,B2,300.0,0,2020-06-03 01:15:00,2020,6,Wednesday,1,,WASHINGTON ST,42.323807,-71.08915,"(42.32380700, -71.08915000)"
515081,20030892,540,,BURGLARY - COMMERICAL,C11,380.0,0,2020-05-03 00:00:00,2020,5,Sunday,0,,GALLIVAN BLVD,42.2837,-71.047761,"(42.28370000, -71.04776100)"


Let us look at the contents of the data-frame ... 

| Column Name | Description |
| :--          | :--- |
| [incident_num] [varchar](20) NOT NULL, | Internal BPD report number |
| [offense_code][varchar](25) NULL,| Numerical code of offense description |
| [Offense_Code_Group_Description][varchar](80) NULL, | Internal categorization of [offense_description] |
| [Offense_Description][varchar](80) NULL, | Primary descriptor of incident |
| [district] [varchar](10) NULL,| What district the crime was reported in |
| [reporting_area] [varchar](10) NULL, | RA number associated with the where the crime was reported from. |
| [shooting][char] (1) NULL, | Indicated a shooting took place. |
| [occurred_on] [datetime2](7) NULL, | Earliest date and time the incident could have taken place |
| [UCR_Part] [varchar](25) NULL,| Universal Crime Reporting Part number (1,2, 3) |
| [street] [varchar](50) NULL,| Street name the incident took place |


Offense Codes are [available here](https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/3aeccf51-a231-4555-ba21-74572b4c33d6/download/rmsoffensecodes.xlsx)

We could also look at the offense codes by reading them in as a data-frame. This is an Excel file so we will have to switch to `.read_excel()`


In [25]:
offense_codes = pd.read_excel("https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/3aeccf51-a231-4555-ba21-74572b4c33d6/download/rmsoffensecodes.xlsx")

In [26]:
print(offense_codes)

     CODE                                       NAME
0     612           LARCENY PURSE SNATCH - NO FORCE 
1     613                        LARCENY SHOPLIFTING
2     615    LARCENY THEFT OF MV PARTS & ACCESSORIES
3    1731                                     INCEST
4    3111                  LICENSE PREMISE VIOLATION
..    ...                                        ...
571  1806  DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS
572  1807  DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS
573  1610    HUMAN TRAFFICKING - COMMERCIAL SEX ACTS
574  2010                              HOME INVASION
575  1620  HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE

[576 rows x 2 columns]


The next step would be to see how many data points we have, and what the minimum, maximum values, what is the average, etc. This can be done with `.describe()`

In [27]:
df.describe()

Unnamed: 0,OFFENSE_CODE,YEAR,MONTH,HOUR,Lat,Long
count,515082.0,515082.0,515082.0,515082.0,485909.0,485909.0
mean,2333.275632,2017.542933,6.634194,13.07917,42.239043,-70.949353
std,1182.489822,1.543329,3.317964,6.347259,1.891645,3.060012
min,111.0,2015.0,1.0,0.0,-1.0,-71.203312
25%,1102.0,2016.0,4.0,9.0,42.296861,-71.097465
50%,3005.0,2018.0,7.0,14.0,42.325029,-71.077723
75%,3201.0,2019.0,9.0,18.0,42.348312,-71.062562
max,3831.0,2020.0,12.0,23.0,42.395042,0.0


By default the command will report the values with decimals but we may not want that. Decimals can be `rounded` or removed altogether as shown below.

In [28]:
df.describe().round(2)

Unnamed: 0,OFFENSE_CODE,YEAR,MONTH,HOUR,Lat,Long
count,515082.0,515082.0,515082.0,515082.0,485909.0,485909.0
mean,2333.28,2017.54,6.63,13.08,42.24,-70.95
std,1182.49,1.54,3.32,6.35,1.89,3.06
min,111.0,2015.0,1.0,0.0,-1.0,-71.2
25%,1102.0,2016.0,4.0,9.0,42.3,-71.1
50%,3005.0,2018.0,7.0,14.0,42.33,-71.08
75%,3201.0,2019.0,9.0,18.0,42.35,-71.06
max,3831.0,2020.0,12.0,23.0,42.4,0.0


In [29]:
df.describe().round(0)

Unnamed: 0,OFFENSE_CODE,YEAR,MONTH,HOUR,Lat,Long
count,515082.0,515082.0,515082.0,515082.0,485909.0,485909.0
mean,2333.0,2018.0,7.0,13.0,42.0,-71.0
std,1182.0,2.0,3.0,6.0,2.0,3.0
min,111.0,2015.0,1.0,0.0,-1.0,-71.0
25%,1102.0,2016.0,4.0,9.0,42.0,-71.0
50%,3005.0,2018.0,7.0,14.0,42.0,-71.0
75%,3201.0,2019.0,9.0,18.0,42.0,-71.0
max,3831.0,2020.0,12.0,23.0,42.0,0.0


Note a few things here. 

* We have a total of 515082 incidents logged. But the latitude and longitude are availale for no more than 485909 incidents. 


Say we want to restrict the dataframe just to 2020. How can we do that?

In [30]:
df20 = df[ df['YEAR'] == 2020 ]

Notice the sequence here `dataframe[ dataframe['column-name'] == somevalue ]` & pay attention to the double equal sign `==` which is a strict equality. 

In [31]:
df20.describe()

Unnamed: 0,OFFENSE_CODE,YEAR,MONTH,HOUR,Lat,Long
count,63733.0,63733.0,63733.0,63733.0,62200.0,62200.0
mean,2353.137323,2020.0,4.900554,12.923525,42.319872,-71.084193
std,1182.670996,0.0,2.561463,6.566899,0.032339,0.030578
min,111.0,2020.0,1.0,0.0,42.181845,-71.203312
25%,1001.0,2020.0,3.0,9.0,42.295353,-71.098579
50%,3005.0,2020.0,5.0,14.0,42.321918,-71.078444
75%,3207.0,2020.0,7.0,18.0,42.344561,-71.062
max,3831.0,2020.0,9.0,23.0,42.395041,-70.953726


At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

In [32]:
df20['OFFENSE_CODE'].nunique()

130

In [33]:
df20['OFFENSE_CODE'].value_counts()

3301    6234
3115    5494
801     3908
3005    3227
3831    2700
        ... 
1603       2
3203       1
2672       1
990        1
2628       1
Name: OFFENSE_CODE, Length: 130, dtype: int64

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.   

In [34]:
# Just another way to accomplish the same thing but in a more complicated way.

df20.groupby('OFFENSE_CODE')['OFFENSE_CODE'].count().reset_index(name='count').sort_values(['count'], ascending = False) 

Unnamed: 0,OFFENSE_CODE,count
109,3301,6234
95,3115,5494
23,801,3908
82,3005,3227
129,3831,2700
...,...,...
86,3016,2
26,990,1
106,3203,1
76,2672,1


Let us focus in on these verbal disputes. We will do so by creating a new dataframe that is only for OFFENSE_CODE 3301.

In [35]:
dfverbal = df20[ df20['OFFENSE_CODE'] == 3301 ]

In [36]:
# Now we see this dataframe just to check

dfverbal

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
1,S97333701,3301,,VERBAL DISPUTE,C6,915,0,2020-07-18 14:34:00,2020,7,Saturday,14,,MARY BOYLE WAY,42.330813,-71.051368,"(42.33081300, -71.05136800)"
13,I20205924,3301,,VERBAL DISPUTE,E5,822,0,2020-07-19 03:10:00,2020,7,Sunday,3,,EDGEMERE RD,42.258676,-71.151164,"(42.25867600, -71.15116400)"
305302,202062797,3301,,VERBAL DISPUTE,E5,,0,2020-08-30 11:08:00,2020,8,Sunday,11,,,42.293540,-71.122772,"(42.29354000, -71.12277200)"
306053,202053448,3301,,VERBAL DISPUTE,D4,623,0,2020-07-28 13:32:00,2020,7,Tuesday,13,,COMMONWEALTH AVE,,,"(0.00000000, 0.00000000)"
426898,212014631,3301,,VERBAL DISPUTE,B3,943,0,2020-02-23 06:52:00,2020,2,Sunday,6,,WOODBOLE AVE,42.277355,-71.079358,"(42.27735500, -71.07935800)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
491035,202000014,3301,,VERBAL DISPUTE,D4,164,0,2020-01-01 00:30:00,2020,1,Wednesday,0,,HARRISON AVE,42.341362,-71.067076,"(42.34136200, -71.06707600)"
491069,200359470,3301,,VERBAL DISPUTE,B2,316,0,2020-07-16 22:52:00,2020,7,Thursday,22,,HOMESTEAD ST,42.312421,-71.091549,"(42.31242100, -71.09154900)"
491101,200204118,3301,,VERBAL DISPUTE,A7,824,0,2020-04-28 19:10:00,2020,4,Tuesday,19,,TRENTON ST,42.379684,-71.034286,"(42.37968400, -71.03428600)"
491108,200157167,3301,,VERBAL DISPUTE,E13,572,0,2020-03-28 09:18:00,2020,3,Saturday,9,,COLUMBUS AVE,42.313627,-71.095603,"(42.31362700, -71.09560300)"


Which days of the week have more verbal disputes?

In [37]:
dfverbal['DAY_OF_WEEK'].value_counts()

Friday       944
Saturday     939
Sunday       918
Thursday     878
Tuesday      866
Wednesday    857
Monday       832
Name: DAY_OF_WEEK, dtype: int64

Which hour, which streets have the most verbal disputes?

In [38]:
dfverbal['HOUR'].value_counts()

21    418
20    401
18    381
0     365
19    364
22    352
16    329
17    324
23    321
11    307
12    295
15    287
14    272
13    270
10    260
1     242
9     217
2     182
8     150
7     130
3     122
4      97
6      81
5      67
Name: HOUR, dtype: int64

In [39]:
dfverbal['STREET'].value_counts()

WASHINGTON ST     209
BLUE HILL AVE     113
COLUMBIA RD        91
CENTRE ST          84
DORCHESTER AVE     81
                 ... 
US-20               1
FLAHERTY WAY        1
DEVER ST            1
FAIRBANKS ST        1
DELL AVE            1
Name: STREET, Length: 1405, dtype: int64

What districts are the worst?

In [40]:
dfverbal['DISTRICT'].value_counts()

C11         1250
B3          1231
B2          1208
E18          476
D4           427
D14          296
E5           285
C6           275
E13          273
A7           265
A1           135
A15          103
External       3
Name: DISTRICT, dtype: int64

Lookup districts C11, B3, and B2 ... what areas are these?

# Practice Task 01

Pick another data-set from `data.boston.gov` and go through the same commands, picking some interesting element of the dataframe to explore