# Location Data Cleaning

This script cleans and converts the MTA location data into a form that is appropriate for ARM analysis. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Initally, the subway location data comprises all of the entrances and exits of each individual subway station, but we are only concerned with the routes associated with each station, so only the columns that identify each station and its accompanying routes are kept. Duplicates are dropped accordingly (they were previously distinguished as the information for different entrances and exits to the stations themselves.). A new variable that is composed of all of the routes for each station must be created, however considering there are 11 columns for different routes, and most stations do not have nearly as many as that (42nd St. Times Square has 11), these NA values must be temporarily filled in to allow for the combination of the columns. These placeholder value, "X", will be removed in the final, modified data set.

In [41]:
# Read in data and subset to route columns
df = pd.read_csv("../../data/00-raw-data/MTA-Subway-Station-Location-Data.csv",index_col=0)
df = df[['division','line','station_name','route1','route2','route3','route4','route5','route6','route7','route8','route9','route10','route11']]
df = df.drop_duplicates()
# Replace NA route values and convert to str
df= df.fillna("X")
df = df.astype(str)

Subsequently, the columns containing the route info are concatenated to create a "basket" of routes for each station (the variable 'allroutes'). The X's and excessive commas are then removed.

In [50]:
df['allroutes'] = df['route1'] + "," + df['route2'] + "," + df['route3'] + "," + df['route4'] + "," + df['route5'] + "," + df['route6'] + "," + df['route7'] + "," + df['route8'] + "," + df['route9'] + "," + df['route10'] + "," + df['route11']
df['allroutes'] = df['allroutes'].str.replace(",X","")
# Save dataframe in csv
df.to_csv("../../data/01-modified-data/Location-Data-Cleaned.csv")

In [52]:
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.downloader import download

import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from apyori import apriori
import networkx as nx 

download('vader_lexicon')
download('stopwords')
download('wordnet')
download('punkt')
download('omw-1.4')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\alexp\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alexp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alexp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alexp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\alexp\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
df = df['']

In [None]:
results = list(apriori(df,min_support=0.005,min_confidence=0.004,min_length=1,max_length=9))