# Data Science Python Toolkit

<font color = 'blue'>Importing Functions and Libraries</font>

The first thing that we'll need to do in order to work with our data is to import
the operating systems and additional libraries that we'll need to use
if our code is like a recipe, these imports are like the cooking tools and equipment
we'll use: 

In [1]:
import os
import pandas as pd
from pandas import ExcelWriter #if we're importing a file from Excel
from pandas import ExcelFile #if we're importing a file from Excel
import numpy as np

for now, and 

In [2]:
import matplotlib.pyplot as plt
from ipywidgets import *
%matplotlib inline

when we start working with plots/graphs

<font color = 'blue'>Importing Datasets</font>

Next, we'll want to import our data as a dataframe.
This will probably be in an Excel or CSV (comma separated values) format, which we can import from an Excel file with 

df = pd.read_excel(<font color = 'red'>'file path name'</font>, sheetname = <font color = 'red'>'Sheet, if you want to grab info from any sheet other than the first one (default)'</font>). 

Or, in a CSV file with:

In [3]:
df = pd.read_csv('ECB_Citations.csv')
# df = pd.read_csv("C:\\Users\\melanie.shimano\\Documents\\ECB_Citations.csv")
#download the ECB Citation dataset and copy that file's path name in order to follow along with examples in this notebook

When you copy the file path name, you'll need to either add an extra '\' whenever one appears in the file path name, replace the '\'s with '/'s, or add an r before the quote in your file name pathway. This is because a backwards slash in python has another function.

You can get the file path name by holding shift while right-clicking the document, then chosing "Copy Path" and pasting that information into your code in ''

<font color = 'blue'>Indexing Datasets</font>

Next, we'll want to index our dataframe so that we can label our rows and use this as a row identifier.
You can set your index to a column name to help identify what you're looking for with:

In [4]:
df.set_index('ViolationDate')

Unnamed: 0_level_0,CitationNo,LienCode,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,HearingDate,...,OfficerPresenceRequested,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location
ViolationDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
08/10/2016,54480041,L,09/09/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/27/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Brooklyn,Southern,10.0,"3420 7TH ST\nBaltimore, MD\n(39.239066, -76.59..."
08/17/2016,54479266,L,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,10/12/2016,...,Y,GR,0130P,$0.00,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1905 SHERWOOD AVE\nBaltimore, MD\n(39.312197, ..."
08/17/2016,54479357,L,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$0.00,10/11/2016,$500.00,,...,,,,$500.00,$0.00,$0.00,Broadway East,Eastern,13.0,"1226 N PATTERSON PARK AVE\nBaltimore, MD\n(39...."
08/19/2016,54479001,L,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,South Baltimore,Southern,11.0,"1714 MARSHALL ST\nBaltimore, MD\n(39.270047, -..."
08/18/2016,54478748,L,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Arlington,Northwestern,5.0,"5415 NELSON AVE\nBaltimore, MD\n(39.34874, -76..."
08/19/2016,54479787,L,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,TRASH ACCUMULATION ...,$0.00,,$0.00,,...,Y,,,$0.00,$0.00,$50.00,Perring Loch,Notheastern,3.0,"1928 HILLENWOOD ROAD\nBaltimore, MD\n(39.35422..."
08/18/2016,54479498,L,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$100.00,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$0.00,09/16/2016,$100.00,,...,,,,$100.00,$0.00,$0.00,Bayview,Southeastern,1.0,"0307 DREW ST\nBaltimore, MD\n(39.290297, -76.5..."
08/18/2016,54479712,L,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,TRASH ACCUMULATION ...,$0.00,08/30/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"3023 KENTUCKY AVE\nBaltimore, MD\n(39.323356, ..."
08/18/2016,54478896,L,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,HIGH GRASS AND WEEDS ...,$0.00,09/14/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"2910 EDISON HWY\nBaltimore, MD\n(39.31585, -76..."
08/19/2016,54479043,L,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,HIGH GRASS AND WEEDS ...,$0.00,10/07/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,South Baltimore,Southern,11.0,"1712 MARSHALL ST\nBaltimore, MD\n(39.270079, -..."


If you need to access this data later, you'll be able to reset the index by

In [5]:
df.set_index('CitationNo')

Unnamed: 0_level_0,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,HearingDate,...,OfficerPresenceRequested,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location
CitationNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
54480041,L,08/10/2016,09/09/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/27/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Brooklyn,Southern,10.0,"3420 7TH ST\nBaltimore, MD\n(39.239066, -76.59..."
54479266,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,10/12/2016,...,Y,GR,0130P,$0.00,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1905 SHERWOOD AVE\nBaltimore, MD\n(39.312197, ..."
54479357,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$0.00,10/11/2016,$500.00,,...,,,,$500.00,$0.00,$0.00,Broadway East,Eastern,13.0,"1226 N PATTERSON PARK AVE\nBaltimore, MD\n(39...."
54479001,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,South Baltimore,Southern,11.0,"1714 MARSHALL ST\nBaltimore, MD\n(39.270047, -..."
54478748,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Arlington,Northwestern,5.0,"5415 NELSON AVE\nBaltimore, MD\n(39.34874, -76..."
54479787,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,TRASH ACCUMULATION ...,$0.00,,$0.00,,...,Y,,,$0.00,$0.00,$50.00,Perring Loch,Notheastern,3.0,"1928 HILLENWOOD ROAD\nBaltimore, MD\n(39.35422..."
54479498,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$100.00,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$0.00,09/16/2016,$100.00,,...,,,,$100.00,$0.00,$0.00,Bayview,Southeastern,1.0,"0307 DREW ST\nBaltimore, MD\n(39.290297, -76.5..."
54479712,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,TRASH ACCUMULATION ...,$0.00,08/30/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"3023 KENTUCKY AVE\nBaltimore, MD\n(39.323356, ..."
54478896,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,HIGH GRASS AND WEEDS ...,$0.00,09/14/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"2910 EDISON HWY\nBaltimore, MD\n(39.31585, -76..."
54479043,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,HIGH GRASS AND WEEDS ...,$0.00,10/07/2016,$50.00,,...,,,,$50.00,$0.00,$0.00,South Baltimore,Southern,11.0,"1712 MARSHALL ST\nBaltimore, MD\n(39.270079, -..."


<font color = 'blue'>Previewing Datasets</font>

If we want to check that we imported the correct dataset, then we can preview our dataframe by looking at the top 5 rows with

In [6]:
df.head()

Unnamed: 0,CitationNo,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,...,OfficerPresenceRequested,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location
0,54480041,L,08/10/2016,09/09/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/27/2016,$50.00,...,,,,$50.00,$0.00,$0.00,Brooklyn,Southern,10.0,"3420 7TH ST\nBaltimore, MD\n(39.239066, -76.59..."
1,54479266,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,...,Y,GR,0130P,$0.00,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1905 SHERWOOD AVE\nBaltimore, MD\n(39.312197, ..."
2,54479357,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$0.00,10/11/2016,$500.00,...,,,,$500.00,$0.00,$0.00,Broadway East,Eastern,13.0,"1226 N PATTERSON PARK AVE\nBaltimore, MD\n(39...."
3,54479001,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,,$50.00,$0.00,$0.00,South Baltimore,Southern,11.0,"1714 MARSHALL ST\nBaltimore, MD\n(39.270047, -..."
4,54478748,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,,$50.00,$0.00,$0.00,Arlington,Northwestern,5.0,"5415 NELSON AVE\nBaltimore, MD\n(39.34874, -76..."


Or the bottom 5 rows with

In [7]:
df.tail()

Unnamed: 0,CitationNo,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,...,OfficerPresenceRequested,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location
230841,54479282,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,...,Y,GR,0130P,$0.00,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1909 SHERWOOD AVE\nBaltimore, MD\n(39.312269, ..."
230842,54642012,L,01/11/2017,02/12/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$100.00,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$100.00,,$0.00,...,,,,$0.00,$0.00,$0.00,Carrollton Ridge,Southwestern,9.0,"2106 RAMSAY ST\nBaltimore, MD\n(39.28295, -76...."
230843,54728886,L,01/30/2017,03/04/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$500.00,03/27/2017,$500.00,...,,,,$500.00,$0.00,$0.00,Irvington,Southwestern,8.0,"4200 CONNECTICUT AVE\nBaltimore, MD\n(39.28429..."
230844,54729488,L,01/31/2017,03/04/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$500.00,05/10/2017,$100.00,...,,GR,0130P,$100.00,$400.00,$0.00,Broadway East,Eastern,12.0,"1722 E LAFAYETTE AVE\nBaltimore, MD\n(39.31068..."
230845,54729876,L,01/31/2017,03/04/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,UNREGISTERED OR INOPERABLE VEHICLE ...,$50.00,03/29/2017,$50.00,...,,,,$50.00,$0.00,$0.00,Kenilworth Park,Northern,4.0,"5208 SAINT GEORGES AVE\nBaltimore, MD\n(39.352..."


Or a specific number of rows at the top or bottom by putting that number in the parentheses

Even if it looks like most of our data is self-explanatory, we might want to check the kind of data stored in case it's all in strings, etc.
We can do this by:

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230846 entries, 0 to 230845
Data columns (total 29 columns):
CitationNo                    230846 non-null int64
LienCode                      230846 non-null object
ViolationDate                 230846 non-null object
DueDate                       230846 non-null object
Agency                        230846 non-null object
FineAmount                    230846 non-null object
Description                   230846 non-null object
Balance                       230846 non-null object
LastPaidDate                  109084 non-null object
LastPaidAmount                230846 non-null object
HearingDate                   28247 non-null object
HearingRequestReceivedDate    23328 non-null object
CitationStatus                230846 non-null object
ViolationCodeArticle          230846 non-null object
ViolationCodeSection          230846 non-null object
ViolationLocation             230846 non-null object
Block                         230846 non-nul

We can also get a quick overview of our data by: 

In [9]:
df.describe()

Unnamed: 0,CitationNo,CouncilDistrict
count,230846.0,208252.0
mean,50832410.0,8.180483
std,11186110.0,3.854663
min,1940899.0,1.0
25%,52513210.0,6.0
50%,53839100.0,9.0
75%,54416570.0,12.0
max,54995430.0,14.0


For all numerical value columns in our dataset, or we can look at a particular column's statistics by typing the column's name in quotes in the parentheses. If we look back at our df.info() results, we see that only CitationNo and CouncilDistrict columns are numerical values (floats).

If there are null/NaN (not a number) values in our dataframe, we might want to delete them and their corresponding rows in order to do specific calculations. (Depending on the data and the question, you may want to fill the blank data with a particular value or average of the values--for that there is als a fillna( ) method). As for dropping nulls, we can do this with:

In [10]:
df.dropna()

Unnamed: 0,CitationNo,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,...,OfficerPresenceRequested,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location
72,54482294,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$50.00,HIGH GRASS AND WEEDS ...,$0.00,10/06/2016,$15.00,...,N,HR,0130P,$15.00,$50.00,$0.00,Druid Heights,Central,11.0,"1833 DRUID HILL AVE\nBaltimore, MD\n(39.306348..."
107,54480827,L,08/22/2016,09/21/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$900.00,FAILED TO ABATE UNSAFE STRUCTURE NOTICE AND OR...,$0.00,10/06/2016,$215.00,...,N,GR,1100A,$215.00,$700.00,$0.00,Woodbourne-McCabe,Northern,4.0,"0702 MCCABE AVE\nBaltimore, MD\n(39.352396, -7..."
132,54599063,L,12/27/2016,01/28/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$100.00,TRASH ACCUMULATION ...,$100.00,02/02/2017,$20.00,...,,GR,1100A,$20.00,$95.00,$0.00,Charles Village,Northern,12.0,"2633 SAINT PAUL ST\nBaltimore, MD\n(39.320399,..."
150,54484423,L,08/23/2016,09/22/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$250.00,FAILED TO ABATE VIOLATION NOTICE AND ORDER ...,$0.00,10/06/2016,$100.00,...,N,HR,0130P,$100.00,$165.00,$0.00,Coldstream Homestead Montebello,Notheastern,14.0,"2521 GARRETT AVE\nBaltimore, MD\n(39.319214, -..."
286,54486477,L,08/24/2016,09/23/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,WORK WITHOUT OR BEYOND THE SCOPE OF PERMIT ...,$0.00,10/06/2016,$100.00,...,N,HR,0130P,$100.00,$415.00,$0.00,Bolton Hill,Central,11.0,"1311 LINDEN GREEN\nBaltimore, MD\n(39.305094, ..."
393,54632153,L,01/10/2017,02/11/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$100.00,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$100.00,03/15/2017,$25.00,...,,GR,0900A,$25.00,$90.00,$0.00,Cherry Hill,Southern,10.0,"0614 HILLVIEW ROAD\nBaltimore, MD\n(39.251083,..."
421,54336466,L,04/22/2016,05/22/2016,BALTIMORE CITY HEALTH DEPARTMENT ...,$500.00,HUMANE CARE REQUIRED - FOOD/SHELTER/VET CARE ...,$0.00,08/24/2016,$50.00,...,N,GR,0900A,$50.00,$450.00,$0.00,Evergreen Lawn,Western,9.0,"2327 CALVERTON HEIGHTS AVE\nBaltimore, MD\n(39..."
605,54236187,N,12/29/2015,01/28/2016,BALTIMORE CITY HEALTH DEPARTMENT ...,$200.00,RESTRAINTS REQUIRED OR RESTRAINTS IMPROPER ...,$0.00,07/26/2016,$25.00,...,N,GR,0130P,$50.00,$150.00,$0.00,Winston-Govans,Northern,4.0,"0520 CHATEAU AVE\nBaltimore, MD\n(39.350728, -..."
709,54901913,L,07/20/2017,08/20/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,FAILED TO OBTAIN LICENSE TO OPERATE A MULTI-FA...,$500.00,08/11/2017,$500.00,...,,HR,0900A,$500.00,$0.00,$0.00,Reservoir Hill,Central,7.0,"0916 NEWINGTON AVE\nBaltimore, MD\n(39.313572,..."
729,54898838,L,07/18/2017,08/19/2017,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,$500.00,WORK WITHOUT OR BEYOND THE SCOPE OF PERMIT ...,$500.00,09/07/2017,$500.00,...,,HR,0130P,$500.00,$15.00,$0.00,Hillen,Notheastern,3.0,"1537 KINGSWAY ROAD\nBaltimore, MD\n(39.344318,..."


If our data isn't in the format that we want (e.g. numbers are stored as a string instead of an integer, etc.), then we can change an entire column's format by:

In [11]:
df.col_name = df.col_name.astype(np.int64) #changing to integer
df.col_name = df.col_name.astype(np.float) #changing to float
df.col_name = df.col_name.astype(np.str) #changing to string

AttributeError: 'DataFrame' object has no attribute 'col_name'

# Data Manipulation

<font color = 'blue'>Column Operations</font>

We can do the same operations that we use with single ints and floats with entire columns. When we do this, we'll probably want to create a new column in our dataframe to hold the values from our new calculations. We do this by:

In [None]:
df['cit_balance']= #column operation that will fill values in new_column

When we perform column operations, we'll use the dataframe name and column name somewhat like a variable in the format: df['existing_column'].  For example, if we want to subtract the TotalPaid column from the FineAmount column and set the values in a new cit_balance in dataframe df:

In [12]:
df.FineAmount = df.FineAmount.str.replace('$', '').replace(',', '').astype(np.float64) 
df.TotalPaid = df.TotalPaid.str.replace('$', '').astype(np.float64)
#in order to perform calculations with these values, we need to convert the string objects (see df.info()) to floats (they 
#have decimals). We define the column in the dataframe to change, remove the $, and change the data type to float

Now if we look at our dataframe's statistics with .describe(), we will see information for all of the columns that we changed to numerical values: 

In [None]:
df.describe()

In [None]:
df.sort_values('FineAmount', ascending = False)

We need to implement the code above before we can perform a new column operation (try to run the code below fist, and you'll get an error message). The above code does the following: 
1. The first df.FineAmount = redefines the column and dataframe that we will edit
2. The second df.FineAmount states which column in the dataframe we are editing
3. .str.replace says that in the current string, we are replacing all of the $ with nothing (' ' are empty quotes)
4. .astype(npfloat64) says that we are redefining the string as a float

We need to remove the $ from our numbers before we redefine the string because characters don't translate to numerical values

In [None]:
df['cit_balance']= df['FineAmount'] - df['TotalPaid']
df

We can do column calculations like this with any operator (e.g. +, -, *, /, etc.), and we can also combine column calculations with regular variable or integer calculations.  For example:

In [None]:
df['new_column']= 2*df['FineAmount']
df['new_column_2'] = (df['cit_balance']/df['FineAmount'])*100 #if we want to know the percentage of fees still owed

We can also calculate information on columns as a whole. For example:

In [None]:
df['your column'].sum() #will give us the sum of the column
df['your column'].mean() #will give us the mean of the column
df['your column'].count() #will give us the number of items in the column
df['your column'].std() #will give us the standard deviation of the col values

In [None]:
#For example, if we want to know the average citation price
df['FineAmount'].mean()

<font color = 'blue'>Split-Apply-Combine</font>

If we're dealing with a particularly large or complicated database (most of them), we'll want to use the Split-Apply-Combine strategy for analyzing our data. This means that we'll break up a huge dataset into smaller, more manageable pieces, operate on each piece individually, and then put it all back together.

__Split__: We want to split up our data into meaningful groups. Usually, our dataset will have some sort of features that we'll want to sort by already.  

For example, if we look at the dataset in our Github page for ECB Citations, we'll see that we have a lot of information about the citations listed. It might be helpful to calculate the total fine amount in our dataset, but it'll probably be more helpful to calculate the fine amounts by different categories such as Agency, Description, or Block.

When we want to split our dataset, we use the pandas function __groupby__, for example: 

In [None]:
new_df = df.groupby('Agency') #will create a new dataframe, new_df, that groups the citations by Agency
new_df = df.groupby('Block') #will create a new dataframe, new_df, that groups the citations by Block

__Apply & Combine__: Next, we'll want to apply calculations on these groups, and combine the results in a new column in our dataframe, etc.

If we simply write: 

In [None]:
df.groupby('Agency').count() 

This will make the Agency name the new dataframe index and fill in the Agency count value for every "cell" in our dataframe, which we don't want!

If we simply want to look at the counts for each agency, then we can create a __Series__, which looks somewhat like a database, but we can't necessarily do calculations or merge this in its current format with a dataframe. In order to get a series, we can write: 

In [None]:
df['Agency'].value_counts()

If we want to create a new column with the Agency counts, then we can do this a few different ways using the transform function (which "transforms" the identified column based on our input, where the output is the same size and shape as the original data): 

In [None]:
df['Agency_Count']= df.groupby('Agency').transform(len) #this will input the "length" the list of each 'Agency' group in a new column, 'Agency_Count'
#NOTE: every row with the same 'Agency' name will have the same 'Agency_Count' number

In [None]:
df['Agency_Count']= df.groupby(['Agency'])['Description'].transform('count') 
#this will count the number of 'Descriptions' listed for each 'Agency' group and input that number in the new column, 'Agency_Count'
#You can use any column identifier here if you are simply counting values

The transform function also allows us to do other calculations in groups such as: 

In [None]:
df['Ag_Fine_Sum']= df.groupby(['Agency'])['FineAmount'].transform('sum') 
#will sum the values in FineAmount for each agency and input that value in the new column, 'Ag_Fine_Sum'

Sometimes it makes more sense to do calculations on each row in a column based on a pre-identified condition. In order to do this we'll transform the data using __lambda__ functions.  Lambda functions run functions on each item in a column based on pre-set conditions. For example, if we want to create a new column that counts the number of citations over $ 500 that an Agency receives: 

In [None]:
df['fines_over_500'] = df.groupby(['Agency'])['FineAmount'].transform(lambda x: (x>500).count())
df.sort_values('FineAmount')


df['fines_over_500'] = df.groupby(['Agency'])['FineAmount'].transform(lambda x: (x>500).count()) means: 
1. df['fines_over_500'] = : defines the new column in our dataframe that we want to make with our data
2. df.groupby(['Agency']) : we want to group our dataframe the Agency type
3. ['FineAmount'] : in each Agency group, we are going to do something with the FineAmount column
4. .transform : we are going to transform our data in the FineAmount column by...
5. (lambda x: (x>500).count()) : using a lambda function! This part of the code tells us that for every row (x) in the column we pre-identified, we will check if that value is greater than 500 (x>500).  If it is, then we'll count it; if it's not, then we won't count it (.count())
6. This will fill in the number of citations issues over $ 500 per Agency, so all rows corresponding to a particular agency will have the same value in this column. This might be helpful if we want to perform column-to-column calculations 

This data grouping is great for helping us identify trends in our data, but if we scroll through the data, we'll see that this citation data occurs over several years. It might be useful to analyze this data for the entire time period provided, but we might also want to look at specific time periods such as years, months, quarters, etc.

We can use another column operation to filter out specific dates, but we''ll first need to convert our "time" data columns to a datetime format (remember that when we looked at the df.info() almost everything was stored as an 'object'). If we want to filter information by the ViolationDate, we'll convert the data by:

In [21]:
pd.to_datetime(df['ViolationDate'])

0        2016-08-10
1        2016-08-17
2        2016-08-17
3        2016-08-19
4        2016-08-18
5        2016-08-19
6        2016-08-18
7        2016-08-18
8        2016-08-18
9        2016-08-19
10       2016-08-18
11       2016-08-18
12       2016-08-08
13       2016-08-19
14       2016-08-19
15       2016-08-19
16       2016-08-18
17       2016-08-17
18       2016-08-16
19       2016-08-19
20       2016-01-05
21       2016-08-16
22       2016-08-18
23       2016-08-16
24       2016-08-18
25       2016-08-18
26       2016-08-19
27       2016-08-17
28       2016-08-19
29       2016-08-19
            ...    
230816   2016-08-16
230817   2016-08-18
230818   2016-08-17
230819   2016-08-16
230820   2016-08-15
230821   2016-08-18
230822   2016-08-18
230823   2016-08-17
230824   2016-08-18
230825   2016-08-18
230826   2016-08-18
230827   2016-08-15
230828   2016-08-15
230829   2016-08-15
230830   2016-08-18
230831   2016-08-19
230832   2016-08-19
230833   2017-01-11
230834   2016-08-19


Then, we can filter the data by ViolationDate. For example, if we want to look at the violation dates in 2013:

In [14]:
df_2013 = df.loc[(df['ViolationDate']> '01/01/2013') & (df['ViolationDate']< '12/31/2013')]
df_2013

Unnamed: 0,CitationNo,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,...,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location,fines_over_200
0,54480041,L,08/10/2016,09/09/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/27/2016,$50.00,...,,,50.0,$0.00,$0.00,Brooklyn,Southern,10.0,"3420 7TH ST\nBaltimore, MD\n(39.239066, -76.59...",
1,54479266,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,500.0,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,...,GR,0130P,0.0,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1905 SHERWOOD AVE\nBaltimore, MD\n(39.312197, ...",
2,54479357,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,500.0,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$0.00,10/11/2016,$500.00,...,,,500.0,$0.00,$0.00,Broadway East,Eastern,13.0,"1226 N PATTERSON PARK AVE\nBaltimore, MD\n(39....",
3,54479001,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,50.0,$0.00,$0.00,South Baltimore,Southern,11.0,"1714 MARSHALL ST\nBaltimore, MD\n(39.270047, -...",
4,54478748,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,50.0,$0.00,$0.00,Arlington,Northwestern,5.0,"5415 NELSON AVE\nBaltimore, MD\n(39.34874, -76...",
5,54479787,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,TRASH ACCUMULATION ...,$0.00,,$0.00,...,,,0.0,$0.00,$50.00,Perring Loch,Notheastern,3.0,"1928 HILLENWOOD ROAD\nBaltimore, MD\n(39.35422...",
6,54479498,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,100.0,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$0.00,09/16/2016,$100.00,...,,,100.0,$0.00,$0.00,Bayview,Southeastern,1.0,"0307 DREW ST\nBaltimore, MD\n(39.290297, -76.5...",
7,54479712,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,TRASH ACCUMULATION ...,$0.00,08/30/2016,$50.00,...,,,50.0,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"3023 KENTUCKY AVE\nBaltimore, MD\n(39.323356, ...",
8,54478896,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,HIGH GRASS AND WEEDS ...,$0.00,09/14/2016,$50.00,...,,,50.0,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"2910 EDISON HWY\nBaltimore, MD\n(39.31585, -76...",
9,54479043,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,HIGH GRASS AND WEEDS ...,$0.00,10/07/2016,$50.00,...,,,50.0,$0.00,$0.00,South Baltimore,Southern,11.0,"1712 MARSHALL ST\nBaltimore, MD\n(39.270079, -...",


The __.loc__ allows us to sort through  or identify parts of our dataframe based on specific row labels. This function acts similarly to lambda in that the code "looks" at every row and checks whether or not it complies with our argument, and we can perform similar some of the same functions (see below), but this method can become inefficient when we have larger dataframes. 

When you try to run this code to perform the same task we did with the lambda function: 

In [13]:
df['fines_over_200'] = df.loc[df['FineAmount']>200].count()
df

Unnamed: 0,CitationNo,LienCode,ViolationDate,DueDate,Agency,FineAmount,Description,Balance,LastPaidDate,LastPaidAmount,...,HearingStatus,HearTime,TotalPaid,TotalAbated,TotalVoided,Neighborhood,PoliceDistrict,CouncilDistrict,Location,fines_over_200
0,54480041,L,08/10/2016,09/09/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/27/2016,$50.00,...,,,50.0,$0.00,$0.00,Brooklyn,Southern,10.0,"3420 7TH ST\nBaltimore, MD\n(39.239066, -76.59...",
1,54479266,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,500.0,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$115.00,,$0.00,...,GR,0130P,0.0,$400.00,$0.00,East Baltimore Midway,Eastern,12.0,"1905 SHERWOOD AVE\nBaltimore, MD\n(39.312197, ...",
2,54479357,L,08/17/2016,09/16/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,500.0,FAILURE TO FILE ANNUAL VACANT BUILDING REGISTR...,$0.00,10/11/2016,$500.00,...,,,500.0,$0.00,$0.00,Broadway East,Eastern,13.0,"1226 N PATTERSON PARK AVE\nBaltimore, MD\n(39....",
3,54479001,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,50.0,$0.00,$0.00,South Baltimore,Southern,11.0,"1714 MARSHALL ST\nBaltimore, MD\n(39.270047, -...",
4,54478748,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,BULK TRASH ...,$0.00,09/09/2016,$50.00,...,,,50.0,$0.00,$0.00,Arlington,Northwestern,5.0,"5415 NELSON AVE\nBaltimore, MD\n(39.34874, -76...",
5,54479787,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,TRASH ACCUMULATION ...,$0.00,,$0.00,...,,,0.0,$0.00,$50.00,Perring Loch,Notheastern,3.0,"1928 HILLENWOOD ROAD\nBaltimore, MD\n(39.35422...",
6,54479498,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,100.0,FAILURE TO FILE A COMPLETED ANNUAL REGISTRATIO...,$0.00,09/16/2016,$100.00,...,,,100.0,$0.00,$0.00,Bayview,Southeastern,1.0,"0307 DREW ST\nBaltimore, MD\n(39.290297, -76.5...",
7,54479712,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,TRASH ACCUMULATION ...,$0.00,08/30/2016,$50.00,...,,,50.0,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"3023 KENTUCKY AVE\nBaltimore, MD\n(39.323356, ...",
8,54478896,L,08/18/2016,09/17/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,HIGH GRASS AND WEEDS ...,$0.00,09/14/2016,$50.00,...,,,50.0,$0.00,$0.00,Belair-Edison,Notheastern,13.0,"2910 EDISON HWY\nBaltimore, MD\n(39.31585, -76...",
9,54479043,L,08/19/2016,09/18/2016,DEPARTMENT OF HOUSING & COMMUNITY DEVELOPMENT ...,50.0,HIGH GRASS AND WEEDS ...,$0.00,10/07/2016,$50.00,...,,,50.0,$0.00,$0.00,South Baltimore,Southern,11.0,"1712 MARSHALL ST\nBaltimore, MD\n(39.270079, -...",


Even though our code works and outputs a dataframe, it also gives us a "SettingWithCopyWarning." This warning basically warns us that our code might not have worked as we expected based on how we re-formatted the dataframe, and even if we get the right output now, this might not be the case if we continue to edit our dataframe. 