# Parsing location information from tweets <br>

_Author: Bala Krishnamoorthy_

### Overview

Once our NLP model has classified which tweets are "useful" (i.e. contain relevant, identifiable information on road blockages), and which tweets are not, the location information from each tweet needs to be extracted. <br> <br>

The **goal** is to find the relevant start and end points of each road closure. This notebook uses regex to identify these start and end points by searching for intersections mentioned within tweets. <br> <br>

Throughout this notebook, you will find my comments in _markdown_ and `#comment` format.

### Contents
- [Data Cleaning](#Data-Cleaning)
- [Data Processing with Regex](#Data-Processing-with-Regex)
- [Final Output](#Final-Output)
- [Future Updates](#Future-Updates)

### Data Cleaning

In [18]:
# Import Libraries
import pandas as pd
import re
from pathlib import Path

In [19]:
# Read in the raw tweets and classified tweets (i.e. after running classification modelling) 
# dataset. 
raw_tweets_df = pd.read_csv('../../data/2-interim/tweets_with_intersection.csv')
classes_tweets_df = pd.read_csv('../../data/2-interim/twitter_corpus.csv', sep='\t', 
                                encoding='latin-1')

In [20]:
classes_tweets_df.head()

Unnamed: 0,tweed_id,rating,tweet
0,0,1,"Road construction, left lane closed in #Albuqu..."
1,1,1,Road construction. right lanes closed in #Pima...
2,2,1,"Road construction, shoulder closed in #ElPaso ..."
3,3,0,Ughhh at the dentist for a cleaning and the si...
4,4,1,Road constructions. two right lanes closed in ...


In [21]:
raw_tweets_df.head()

Unnamed: 0.1,Unnamed: 0,text,intersection,coordinates.coordinates,created_at,place.bounding_box.coordinates,place.country,place.country_code,place.full_name,user.created_at,user.location
0,0,"Road construction, left lane closed in #Albuqu...",Tijeras Ave E,"[-106.65, 35.08653]",Mon Jan 14 14:56:55 +0000 2019,"[[[-106.7916912, 35.0158912], [-106.473745, 35...",United States,US,"Albuquerque, NM",Tue Feb 08 23:44:59 +0000 2011,Albuquerque
1,1,Road construction. right lanes closed in #Pima...,I-10 EB at Ruthrauff Rd,"[-111.0295, 32.29418]",Mon Jan 14 14:31:30 +0000 2019,"[[[-111.083219, 32.057802], [-110.747928, 32.0...",United States,US,"Tucson, AZ",Thu Feb 10 00:00:33 +0000 2011,"Tucson, AZ"
2,2,"Road construction, shoulder closed in #ElPaso ...",I 10 Both EB/WB from Executive Ctr Blvd to Sun...,"[-106.52, 31.79295]",Mon Jan 14 14:28:12 +0000 2019,"[[[-106.634874, 31.6206683], [-106.199987, 31....",United States,US,"El Paso, TX",Thu Feb 10 20:42:54 +0000 2011,"El Paso, TX"
3,3,Ughhh at the dentist for a cleaning and the si...,,,Mon Jan 14 14:08:05 +0000 2019,"[[[-86.044756, 41.426657], [-85.9629436, 41.42...",United States,US,"Nappanee, IN",Thu Jul 30 19:27:12 +0000 2009,"New Paris, IN"
4,4,Road constructions. two right lanes closed in ...,I-10 EB at Ruthrauff Rd,"[-111.0295, 32.29418]",Mon Jan 14 13:49:35 +0000 2019,"[[[-111.083219, 32.057802], [-110.747928, 32.0...",United States,US,"Tucson, AZ",Thu Feb 10 00:00:33 +0000 2011,"Tucson, AZ"


In [22]:
# Drop unneeded columns
classes_tweets_df.drop(axis=1, columns='tweed_id', inplace=True)
raw_tweets_df.drop(axis=1, columns='Unnamed: 0', inplace=True)

In [23]:
classes_tweets_df.head()

Unnamed: 0,rating,tweet
0,1,"Road construction, left lane closed in #Albuqu..."
1,1,Road construction. right lanes closed in #Pima...
2,1,"Road construction, shoulder closed in #ElPaso ..."
3,0,Ughhh at the dentist for a cleaning and the si...
4,1,Road constructions. two right lanes closed in ...


In [24]:
raw_tweets_df.head()

Unnamed: 0,text,intersection,coordinates.coordinates,created_at,place.bounding_box.coordinates,place.country,place.country_code,place.full_name,user.created_at,user.location
0,"Road construction, left lane closed in #Albuqu...",Tijeras Ave E,"[-106.65, 35.08653]",Mon Jan 14 14:56:55 +0000 2019,"[[[-106.7916912, 35.0158912], [-106.473745, 35...",United States,US,"Albuquerque, NM",Tue Feb 08 23:44:59 +0000 2011,Albuquerque
1,Road construction. right lanes closed in #Pima...,I-10 EB at Ruthrauff Rd,"[-111.0295, 32.29418]",Mon Jan 14 14:31:30 +0000 2019,"[[[-111.083219, 32.057802], [-110.747928, 32.0...",United States,US,"Tucson, AZ",Thu Feb 10 00:00:33 +0000 2011,"Tucson, AZ"
2,"Road construction, shoulder closed in #ElPaso ...",I 10 Both EB/WB from Executive Ctr Blvd to Sun...,"[-106.52, 31.79295]",Mon Jan 14 14:28:12 +0000 2019,"[[[-106.634874, 31.6206683], [-106.199987, 31....",United States,US,"El Paso, TX",Thu Feb 10 20:42:54 +0000 2011,"El Paso, TX"
3,Ughhh at the dentist for a cleaning and the si...,,,Mon Jan 14 14:08:05 +0000 2019,"[[[-86.044756, 41.426657], [-85.9629436, 41.42...",United States,US,"Nappanee, IN",Thu Jul 30 19:27:12 +0000 2009,"New Paris, IN"
4,Road constructions. two right lanes closed in ...,I-10 EB at Ruthrauff Rd,"[-111.0295, 32.29418]",Mon Jan 14 13:49:35 +0000 2019,"[[[-111.083219, 32.057802], [-110.747928, 32.0...",United States,US,"Tucson, AZ",Thu Feb 10 00:00:33 +0000 2011,"Tucson, AZ"


In [25]:
# Number of raw tweets by location 
raw_tweets_df['place.full_name'].value_counts()[:5]

Fort Worth, TX     13
Albuquerque, NM    11
San Antonio, TX     7
Phoenix, AZ         6
El Paso, TX         5
Name: place.full_name, dtype: int64

In [27]:
# Tweets rated "2" indicate a road closure. I will only be examining (fully) closed roads in 
# this notebook.
final_tweets_df = classes_tweets_df[classes_tweets_df['rating'] == 2]
final_tweets_df.head()

Unnamed: 0,rating,tweet
5,2,All eastbound lanes are closed due to snow and...
10,2,The drainage project on Center Street in #Vine...
11,2,Closed due to road construction in #FortWorth ...
12,2,Closed due to road construction in #FortWorth ...
15,2,Closed due to road construction in #Evergreen ...


In [29]:
# Combine with original df (join by index) to pull city info
final_tweets_df = pd.merge(final_tweets_df, raw_tweets_df, left_index=True, 
                           right_index=True)

In [30]:
final_tweets_df = final_tweets_df[['tweet','created_at',
                                   'place.bounding_box.coordinates','place.full_name']]
final_tweets_df.reset_index(drop=True, inplace=True)

In [31]:
final_tweets_df.head()

Unnamed: 0,tweet,created_at,place.bounding_box.coordinates,place.full_name
0,All eastbound lanes are closed due to snow and...,Mon Jan 14 13:13:00 +0000 2019,"[[[-124.482003, 32.528832], [-114.131212, 32.5...","California, USA"
1,The drainage project on Center Street in #Vine...,Mon Jan 14 11:23:01 +0000 2019,"[[[-70.62084, 41.4245617], [-70.586828, 41.424...","Vineyard Haven, MA"
2,Closed due to road construction in #FortWorth ...,Mon Jan 14 10:33:20 +0000 2019,"[[[-97.538285, 32.569477], [-97.033542, 32.569...","Fort Worth, TX"
3,Closed due to road construction in #FortWorth ...,Mon Jan 14 10:32:16 +0000 2019,"[[[-97.538285, 32.569477], [-97.033542, 32.569...","Fort Worth, TX"
4,Closed due to road construction in #Evergreen ...,Mon Jan 14 04:43:56 +0000 2019,"[[[-109.060257, 36.992427], [-102.041524, 36.9...","Colorado, USA"


In [32]:
# Check for cities with most number of closed roads
final_tweets_df['place.full_name'].value_counts()[:5]

Phoenix, AZ            6
California, USA        4
Houston, TX            4
San Antonio, TX        3
Oak Ridge North, TX    3
Name: place.full_name, dtype: int64

### Data Processing with Regex

USER INPUT: Choose City in Code Block Below.

In [33]:
# Select city
city = 'Houston, TX'

# Create a df with only the chosen city's relevant tweets
city_tweets = final_tweets_df[final_tweets_df['place.full_name'] == city]['tweet']
city_tweets.reset_index(drop=True, inplace=True)

# Instantiate an empty list to collect relevant intersections on closed roads.
intersection_list = []

In [34]:
# Examine tweet pattern
city_tweets[1]

'Closed due to road construction in #Southside on S Sam Houston Tollway EB between Hwy 288 and Cullen #traffic https://t.co/vuUN2yElDh'

In [35]:
# Extract bounding box (i.e. bbox) for chosen city -- Needed to extract HereMaps data for same
# city
bbox_raw = eval(final_tweets_df[final_tweets_df['place.full_name'] == city]\
['place.bounding_box.coordinates'][6])[0]

# twitter bbox is always defined by 4 corners of the bbox
bbox_top_left = str(bbox_raw[3][1]) + ',' + str(bbox_raw[3][0])
bbox_bottom_right = str(bbox_raw[1][1]) + ',' + str(bbox_raw[1][0])
print('bbox_top_left (lat,long):', bbox_top_left)
print('bbox_bottom_right (lat,long):', bbox_bottom_right)

bbox_top_left (lat,long): 30.1546646,-95.823268
bbox_bottom_right (lat,long): 29.522325,-95.069705


In [36]:
# Find list of strings of "useful" tweets

for i in range(city_tweets.shape[0]):
    # Regex Pattern 1
    intersection_list.append(re.findall(' on (.+)[#]', city_tweets[i]))
    if i == (city_tweets.shape[0] - 1):
        print('intersection list:', intersection_list)
        print()
        print('number of tweets available:', city_tweets.shape[0])
        print('number of patterns found:', len(intersection_list))

intersection list: [['288 S Fwy NB at The S Lp, stopped traffic back to Reed Rd. '], ['S Sam Houston Tollway EB between Hwy 288 and Cullen '], ['S Sam Houston Tollway EB between Hwy 288 and Cullen '], ['S Sam Houston Tollway EB between Hwy 288 and Cullen ']]

number of tweets available: 4
number of patterns found: 4


In [38]:
# Convert list of lists to list of strings
intersection_list_2 = []

for i in intersection_list:
    intersection_list_2.append(i[0])

print('updated intersection list:', intersection_list_2)

updated intersection list: ['288 S Fwy NB at The S Lp, stopped traffic back to Reed Rd. ', 'S Sam Houston Tollway EB between Hwy 288 and Cullen ', 'S Sam Houston Tollway EB between Hwy 288 and Cullen ', 'S Sam Houston Tollway EB between Hwy 288 and Cullen ']


In [40]:
# Instantiate df to hold output intersections
columns = ['city', 'start', 'end', 'tweet']
output_tweets_df = pd.DataFrame(index=range(len(city_tweets)), columns=columns)
output_tweets_df.fillna('', inplace=True)
output_tweets_df

Unnamed: 0,city,start,end,tweet
0,,,,
1,,,,
2,,,,
3,,,,


In [41]:
# Populate city column in outputs df
output_tweets_df.loc[:, 'city'] = city
output_tweets_df

Unnamed: 0,city,start,end,tweet
0,"Houston, TX",,,
1,"Houston, TX",,,
2,"Houston, TX",,,
3,"Houston, TX",,,


In [42]:
# Use Regex to further split the tweet into intersections that describe the start and end 
# of each road closure.

index = 0
for string in intersection_list_2:
    print(string)
    output_tweets_df.loc[index, 'tweet'] = city_tweets[index]
    # Regex Pattern 1
    if 'between' in string:
        split_1 = string.split('between') # Split on 'between' first
        split_2 = split_1[1].split('and')
        start = split_1[0] + '&' + split_2[0]
        end = split_1[0] + '&' + split_2[1]
        output_tweets_df.loc[index, 'start'] = start
        output_tweets_df.loc[index, 'end'] = end
        print('start:', start)
        print('end:', end)
    # Regex Pattern 2
    elif 'at' in string:
        split_1 = string.split('at') # Split on 'at' first
        split_2 = split_1[1].split(',')
        start = split_1[0] + '& ' + split_2[0]
        end = split_1[0] + '& ' + split_2[1]
        output_tweets_df.loc[index, 'start'] = start
        output_tweets_df.loc[index, 'end'] = end
        print('start:', start)
        print('end:', end)
    else:
        print('no matching patterns found for tweet at index:')
    index += 1
    print()

288 S Fwy NB at The S Lp, stopped traffic back to Reed Rd. 
start: 288 S Fwy NB &  The S Lp
end: 288 S Fwy NB &  stopped traffic back to Reed Rd. 

S Sam Houston Tollway EB between Hwy 288 and Cullen 
start: S Sam Houston Tollway EB & Hwy 288 
end: S Sam Houston Tollway EB & Cullen 

S Sam Houston Tollway EB between Hwy 288 and Cullen 
start: S Sam Houston Tollway EB & Hwy 288 
end: S Sam Houston Tollway EB & Cullen 

S Sam Houston Tollway EB between Hwy 288 and Cullen 
start: S Sam Houston Tollway EB & Hwy 288 
end: S Sam Houston Tollway EB & Cullen 



In [43]:
# View output df
output_tweets_df

Unnamed: 0,city,start,end,tweet
0,"Houston, TX",288 S Fwy NB & The S Lp,288 S Fwy NB & stopped traffic back to Reed Rd.,Closed due to road construction on 288 S Fwy N...
1,"Houston, TX",S Sam Houston Tollway EB & Hwy 288,S Sam Houston Tollway EB & Cullen,Closed due to road construction in #Southside ...
2,"Houston, TX",S Sam Houston Tollway EB & Hwy 288,S Sam Houston Tollway EB & Cullen,Closed due to road construction in #Southside ...
3,"Houston, TX",S Sam Houston Tollway EB & Hwy 288,S Sam Houston Tollway EB & Cullen,Closed due to road construction in #Southside ...


In [44]:
# Final Clean up (manual)
output_tweets_df.loc[0, 'end'] = output_tweets_df.loc[0, 'end'].replace('  stopped traffic back to', '')

### Final Output

In [45]:
# Remove any duplicates (e.g. re-tweets)
output_tweets_df.drop_duplicates(inplace=True)
output_tweets_df

Unnamed: 0,city,start,end,tweet
0,"Houston, TX",288 S Fwy NB & The S Lp,288 S Fwy NB & Reed Rd.,Closed due to road construction on 288 S Fwy N...
1,"Houston, TX",S Sam Houston Tollway EB & Hwy 288,S Sam Houston Tollway EB & Cullen,Closed due to road construction in #Southside ...


In [46]:
# Write output df to relevant file
output_tweets_df.to_csv('../../data/2-interim/output_tweets', index=False)

### Future Updates

- Minimize any manual cleaning steps.
- Increase number of regex patterns included to accomodate the diversity in tweet sentence structure.