# Understanding Pandas Series and DataFrames - Lab

# Introduction

In this lab, let's get some hands on practice working with data cleanup using Pandas.

## Objectives
You will be able to:

* Manipulate columns in DataFrames (df.rename, df.drop)
* Manipulate the index in DataFrames (df.reindex, df.drop, df.rename)
* Manipulate column datatypes

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('turnstile_180901.txt')
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,08:00:00,REGULAR,6736105,2283229
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,12:00:00,REGULAR,6736180,2283314
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,16:00:00,REGULAR,6736349,2283384


# Practice

## Objectives
You will be able to:
* Understand and explain what Pandas Series and DataFrames are and how they differ from dictionaries and lists
* Create Series & DataFrames from dictionaries and lists
* Manipulate columns in DataFrames (df.rename, df.drop)
* Manipulate the index in DataFrames (df.reindex, df.drop, df.rename)
* Manipulate column datatypes

# Rename all the columns to lower case

In [3]:
#Your code here
headers = list(df)
for header in headers:
    df = df.rename(columns = {header:header.lower()})
df.head()

Unnamed: 0,c/a,unit,scp,station,linename,division,date,time,desc,entries,exits
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,08:00:00,REGULAR,6736105,2283229
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,12:00:00,REGULAR,6736180,2283314
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,16:00:00,REGULAR,6736349,2283384


# Change the Index to be the Line Names

In [4]:
#Your code here
df = df.set_index('linename')
df.head()

Unnamed: 0_level_0,c/a,unit,scp,station,division,date,time,desc,entries,exits
linename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188
NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,08:00:00,REGULAR,6736105,2283229
NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,12:00:00,REGULAR,6736180,2283314
NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,16:00:00,REGULAR,6736349,2283384


# Painstakingly change the index back

In [5]:
# Your code here
df = df.reset_index()
df.head()

Unnamed: 0,linename,c/a,unit,scp,station,division,date,time,desc,entries,exits
0,NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
1,NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188
2,NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,08:00:00,REGULAR,6736105,2283229
3,NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,12:00:00,REGULAR,6736180,2283314
4,NQR456W,A002,R051,02-00-00,59 ST,BMT,08/25/2018,16:00:00,REGULAR,6736349,2283384


## Create another column 'Num_Lines' that is a count of how many lines pass through a station. Then sort your dataframe by this column in descending order.

In [6]:
df.loc[0:5,df['station']]

KeyError: 'None of [0                 59 ST\n1                 59 ST\n2                 59 ST\n3                 59 ST\n4                 59 ST\n5                 59 ST\n6                 59 ST\n7                 59 ST\n8                 59 ST\n9                 59 ST\n10                59 ST\n11                59 ST\n12                59 ST\n13                59 ST\n14                59 ST\n15                59 ST\n16                59 ST\n17                59 ST\n18                59 ST\n19                59 ST\n20                59 ST\n21                59 ST\n22                59 ST\n23                59 ST\n24                59 ST\n25                59 ST\n26                59 ST\n27                59 ST\n28                59 ST\n29                59 ST\n              ...      \n197595    RIT-ROOSEVELT\n197596    RIT-ROOSEVELT\n197597    RIT-ROOSEVELT\n197598    RIT-ROOSEVELT\n197599    RIT-ROOSEVELT\n197600    RIT-ROOSEVELT\n197601    RIT-ROOSEVELT\n197602    RIT-ROOSEVELT\n197603    RIT-ROOSEVELT\n197604    RIT-ROOSEVELT\n197605    RIT-ROOSEVELT\n197606    RIT-ROOSEVELT\n197607    RIT-ROOSEVELT\n197608    RIT-ROOSEVELT\n197609    RIT-ROOSEVELT\n197610    RIT-ROOSEVELT\n197611    RIT-ROOSEVELT\n197612    RIT-ROOSEVELT\n197613    RIT-ROOSEVELT\n197614    RIT-ROOSEVELT\n197615    RIT-ROOSEVELT\n197616    RIT-ROOSEVELT\n197617    RIT-ROOSEVELT\n197618    RIT-ROOSEVELT\n197619    RIT-ROOSEVELT\n197620    RIT-ROOSEVELT\n197621    RIT-ROOSEVELT\n197622    RIT-ROOSEVELT\n197623    RIT-ROOSEVELT\n197624    RIT-ROOSEVELT\nName: station, Length: 197625, dtype: object] are in the [columns]'

In [None]:
df.loc[df['station']=='CITY / BUS', 'num_lines']

## Write a function to clean a column name.

In [6]:
def clean(col_name):
    cleaned = col_name.strip()
    return cleaned

In [7]:
#This is a list comprehension. It applies your clean function to every item in the list.
#We then reassign that to df.columns
#You shouldn't have to change anything here.
#Your function above should work appropriately here.
df.columns = [clean(col) for col in df.columns] 

In [8]:
#Checking the output, we can see the results.
df.columns

Index(['linename', 'c/a', 'unit', 'scp', 'station', 'division', 'date', 'time',
       'desc', 'entries', 'exits'],
      dtype='object')

## Compare subway traffic by day of the week. Display this as a graph.

In [None]:
#Your code here

## Is there more subway traffic on a weekend or a weekday?    Be specific in comparing magnitudes.

In [None]:
#Your code here

# Drop a couple of columns

In [9]:
# Your code here
del df['c/a']
del df['desc']