# Pandas

**Staff**: Wouter Haverals

**Support material**:
- Class [notebook](https://github.com/dtaantwerp/dtaantwerp.github.io/blob/DTA_Bootcamp_2021_students/notebooks/14_W3_Wed_Pandas.ipynb)
- [Pandas user documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)
- Informative [YouTube playlist](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&ab_channel=DataSchool) by dataschool

<h2 style="color:purple">Datasets</h2>
<a style="font-size:120%;color:blue" href="wget https://raw.githubusercontent.com/dtaantwerp/dtaantwerp.github.io/master/data/311-service-requests.csv">311-service-requests.csv</a> (a 311-call gives access to non-emergency city services and info about city government programs; the file here lists all such requests for New York City)

#### Start by importing pandas and the file we will work with

In [None]:
import pandas as pd

#### Exercise 1: Create a DataFrame object for the file

In [None]:
df = pd.read_csv('311-service-requests.csv') # https://pandas.pydata.org/docs/reference/io.html

In [None]:
# We get a warning... But what is the matter?
#
# Pandas is really nice: it guesses which dtype a column has!
# If Pandas encounters a column with multiple dtypes in it, it will raise a warning. 
# Pandas is so kind to let you know it was confused and something might have happened.
# One way of dealing with this warning is by adding the parameter 'low_memory=False'. 
# Doing this, Pandas will split up the CSV file into multiple chunks 
# and then guess the dtype every chunk, resulting in a column with multiple dtypes.

#### Exercise 2: Explore the DataFrame
Use Pandas methods to 
- Get the first 3 rows of the df
- Get the last 3 rows of the df
- Get general info about the df
- Get the number of rows and columns in the df
- Investigate what data types the df contains
- Find out what columns the df consists of

In [None]:
# first 3 rows of df:
df.head(3)

In [None]:
# last 3 rows of df:
df.tail(3) # another, more pythonesque way of doing this: df[-3:]

In [None]:
# general info about the df:

df.info()

In [None]:
# number of rows and columns in the df:
df.shape # shape-property, returns a tuple representing the dimensionality of the df

In [None]:
# investigate the types:

df.dtypes

In [None]:
# investigate the columns:

df.columns

#### Exercise 3: Use apply() and lambda to lowercase all agency names

In [None]:
# your code here:

df['Agency Name'] = df['Agency Name'].apply(lambda x: x.lower())

# with the apply()-method, you can execute a function to a single column, all columns or a list of columns 
# i.e., apply() will call the supplied function for each value in the Dataframe column
# a lambda function is a single-line function declared with no name, which can have any number of arguments, 
# but it can only have one expression

In [None]:
## another example, with a dummy, self-defined function:

def abbreviate_to_nyc(x): 
    if 'new york city' in x:
        x = x.replace('new york city', 'NYC')
    return x

df['Agency Name'] = df['Agency Name'].apply(add_excl)
df.head()

#### Exercise 4: Column selection
- Slice columns "Agency" and "Agency Name"
- Use .loc to select the columns "Agency" and "Agency Name"

In [None]:
# slice columns. your code here:

df[["Agency", "Agency Name"]]

In [None]:
# select columns. your code here:

df.loc[:, ["Agency", "Agency Name"]]

#### Exercise 5: Row selection
- Slice rows 0 to 10
- Use .iloc to select rows 0 to 10
- Use .iloc to select the value in row 10, column 5
- Use boolean indexing to select rows for which "Agency" equals "NYPD"

In [None]:
# slide rows:

df[0:10]

In [None]:
# select rows 1 to 10 with .iloc:

df.iloc[0:10]

In [None]:
# select the value or row 10, col 5:

df.iloc[10, 5]

In [None]:
# select rows where 'Agency' equals 'NYPD':

df[df["Agency"] == "NYPD"]

#### Exercise 6: Count the frequency of each complaint type using "value_counts()"

In [None]:
# code here:

frequencies = df['Complaint Type'].value_counts() # In Pandas terms, what does this method return?
frequencies

#### Exercise 7: Plot a horizontal bar chart that displays the top 20 most frequent complaint types

In [None]:
import matplotlib.pyplot as plt

In [None]:
# code here:

frequencies[:20].plot(kind='barh') # bar, pie, line, etc...

#### Exercise 8: Find the following statistics of the Series created in exercise 6
- the mean of the frequencies
- the standard deviation of the frequencies
- the highest frequency
- the lowest frequency
- the sum of the frequencies

In [None]:
# code here:

print(frequencies.mean())
print(frequencies.std()) # standard deviation (or Ïƒ) is a measure of how dispersed the data is in relation to the mean
print(frequencies.max())
print(frequencies.min())
print(frequencies.sum())

#### Exercise 9: Take a random sample of 100 rows of the complaints DataFrame and save it to a file

In [None]:
# code here:
sample = df.sample(100, random_state=2022)
sample.to_csv('df.csv')
sample

#### Exercise 10: Duplicates
- Remove all duplicate rows from the complaints data frame (keep first)
- Remove all rows with duplicate agencies (keep last)

"inplace" argument:
- What does it do?
- Which data type?
- Can you drop duplicate rows "inplace" without using the "inplace" argument?

In [None]:
# remove duplicate rows:

df.drop_duplicates()
df.shape # there are no duplicates in this df...

In [None]:
people = [['Jack', 20, 'student'],
          ['Joe', 50, 'engineer'],
          ['Lisa', 4, 'just a kid'],
          ['Jack', 20, 'student']] # oops, there is a duplicate in our data, but we didn't notice...

df_people = pd.DataFrame(people) # data to Pandas DataFrame
df_people

In [None]:
# see what happens here!
df_people.drop_duplicates()

In [None]:
# remove rows with duplicate agencies (keep last):

df.drop_duplicates('Agency', keep='last')