# spaCy 

spaCy is a library used for Natural Language Processing(NLP) in Python. It uses Named Entity Recognition(NER), word vectors and more in order to recognize locations more accurately by detecting certain key words like "at" or "in" before the location. by modifying the strings, SpaCy becomes more effective and gathers more location information.

We specifically call spacy to highlight Countries, cities, and States (GPE), non-tagged locations, mountain ranges, bodies of water (LOC), and buildings, airports, highways, and bridges (FAC)

This notebook is to experiment with Spacy using a subset of our tweet data to attempt to extract locations (cross streets or coordinates) from tweets. spaCy does not require an API key and has a tremendous active database for self education on spaCy. More information on spaCy can be found at https://spacy.io/api.


In [2]:
import pandas as pd
import numpy as np
import re
import spacy
import string
import datetime

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

from spacy import displacy



In [4]:
# Reading in our subset data as df. 

df = pd.read_csv("../data/tweets/historical_tweets.csv")

In [5]:
# Calling our data to ensure we read in correctly.  
df

Unnamed: 0,id,username,date,text,hashtags,geo,type
0,519967790208266240,CaltransDist3,2014-10-08 21:49:16+00:00,Update: ETO is 7 p.m. to reopen EB I-80 betwee...,,,official
1,519952600498593792,CaltransDist3,2014-10-08 20:48:54+00:00,Eastbound Interstate 80 closed between Auburn ...,,,official
2,519937459191578625,CaltransDist3,2014-10-08 19:48:44+00:00,Hwy 113 closed at George Washington in Sutter ...,,,official
3,519877652728274945,CaltransDist3,2014-10-08 15:51:05+00:00,@D3PIO update hwy 162 OT big rig is near Butte...,,,official
4,519876511751749632,CaltransDist3,2014-10-08 15:46:33+00:00,Eastbound 80 at Penryn Rd #2&3 lanes blocked d...,#2,,official
...,...,...,...,...,...,...,...
4125,746387327195418624,CaltransDist6,2016-06-24 16:59:40+00:00,"KERN, Update: Ongoing full closure of 178 near...",,,official
4126,746378276206088193,CaltransDist6,2016-06-24 16:23:42+00:00,Madera: Northbound State Route 99 traffic is b...,#WorkZoneAlert,,official
4127,746361446251040769,CaltransDist6,2016-06-24 15:16:50+00:00,MADERA: Northbound #1 (left) lane on SR-99 clo...,#1,,official
4128,746183382955098112,CaltransDist6,2016-06-24 03:29:16+00:00,KERN: Full closure of 178 near Bodfish/Lake Is...,,,official


In [6]:
# calling the data within the text column.
df['text']

0       Update: ETO is 7 p.m. to reopen EB I-80 betwee...
1       Eastbound Interstate 80 closed between Auburn ...
2       Hwy 113 closed at George Washington in Sutter ...
3       @D3PIO update hwy 162 OT big rig is near Butte...
4       Eastbound 80 at Penryn Rd #2&3 lanes blocked d...
                              ...                        
4125    KERN, Update: Ongoing full closure of 178 near...
4126    Madera: Northbound State Route 99 traffic is b...
4127    MADERA: Northbound #1 (left) lane on SR-99 clo...
4128    KERN: Full closure of 178 near Bodfish/Lake Is...
4129    #CaltransJobs alert - the exam for Engineering...
Name: text, Length: 4130, dtype: object

In [7]:
# Checking the amount of tweets within our subset to ensure a good amount of data is present.
# Print DF shape
print(df.shape)

# Show head of tweets
df.head()

(4130, 7)


Unnamed: 0,id,username,date,text,hashtags,geo,type
0,519967790208266240,CaltransDist3,2014-10-08 21:49:16+00:00,Update: ETO is 7 p.m. to reopen EB I-80 betwee...,,,official
1,519952600498593792,CaltransDist3,2014-10-08 20:48:54+00:00,Eastbound Interstate 80 closed between Auburn ...,,,official
2,519937459191578625,CaltransDist3,2014-10-08 19:48:44+00:00,Hwy 113 closed at George Washington in Sutter ...,,,official
3,519877652728274945,CaltransDist3,2014-10-08 15:51:05+00:00,@D3PIO update hwy 162 OT big rig is near Butte...,,,official
4,519876511751749632,CaltransDist3,2014-10-08 15:46:33+00:00,Eastbound 80 at Penryn Rd #2&3 lanes blocked d...,#2,,official


In [8]:
# Create new column for Spacy tweet text.
df['modified_text'] = ''
df['location'] = ''

# Show modified DF
df.head(2)

Unnamed: 0,id,username,date,text,hashtags,geo,type,modified_text,location
0,519967790208266240,CaltransDist3,2014-10-08 21:49:16+00:00,Update: ETO is 7 p.m. to reopen EB I-80 betwee...,,,official,,
1,519952600498593792,CaltransDist3,2014-10-08 20:48:54+00:00,Eastbound Interstate 80 closed between Auburn ...,,,official,,


In [9]:
# dict from DC past cohert project (defining the abbreviations of roads to aid Spacy in recognizing locations)
format_dict = {"hwy": "highway ",
            "blvd": "boulevard",
            " st": "street",
           "CR ": "County Road ",
           "SR ": "State Road",
           "I-": "Interstate ",
           "EB ": "Eastbound ",
           "WB ": "Westbound ",
           "SB ": "Southbound",
           "NB ": "Northbound",
           " on ": " at ",
           " E ": " East ",
           " W ": " West ",
           " S ": " South",
           " N ": " North",
           "mi ": "mile ",
           "between ": "at ",
           "Between ": "at ",
           " In ": " in",
           " in ": " at "}

In [10]:
def spacy_cleaner(df, col, word_dict):
    modified_text = "At " + df[col].replace(word_dict, regex=True)
    modified_text = modified_text.str.title()
    return modified_text

In [11]:
df['modified_text'] = spacy_cleaner(df, 'text', format_dict)

In [12]:
df['modified_text']

0       At Update: Eto Is 7 P.M. To Reopen Eastbound I...
1       At Eastbound Interstate 80 Closed At Auburn An...
2       At Hwy 113 Closed At George Washington At Sutt...
3       At @D3Pio Update Highway  162 Ot Big Rig Is Ne...
4       At Eastbound 80 At Penryn Rd #2&3 Lanes Blocke...
                              ...                        
4125    At Kern, Update: Ongoing Full Closure Of 178 N...
4126    At Madera: Northbound State Route 99 Traffic I...
4127    At Madera: Northbound #1 (Left) Lane At Sr-99 ...
4128    At Kern: Full Closure Of 178 Near Bodfish/Lake...
4129    At #Caltransjobs Alert - The Exam For Engineer...
Name: modified_text, Length: 4130, dtype: object

In [25]:
def spacy_loc(df, text, geo_column):  # here, we are applying SpaCy to the --
                                     # text column to attempt to extract any location info. 
    
    for i in range(len(df)):  # search entire length of dataframe
        
        #instantiate spacy model
        nlp = spacy.load("en_core_web_sm") # en_core_web_sm is setting language to English
        
        # create documewnt from modified text column
        doc = nlp(df['text'].iloc[i]) #attaching nlp to df to text column, then --
                                           # calling back to doc
        
        locations = set()

## still needs to be defined

# Code below used from Location-Extraction notebook from DC GA-Twitter-Road-Closures-Client-Project
        # loop through every entity in the doc
        for ent in doc.ents:
            
            # find entities labelled as places
            if (ent.label_=='GPE') or (ent.label_=='FAC') or (ent.label_ == 'LOC'):
                
                # put locations in a set
                locations.add(ent.text)
                df[geo_column].iloc[i] = locations
                
    return df[geo_column]

In [26]:
df.head()

Unnamed: 0,id,username,date,text,hashtags,geo,type,modified_text,location
0,519967790208266240,CaltransDist3,2014-10-08 21:49:16+00:00,Update: ETO is 7 p.m. to reopen EB I-80 betwee...,,,official,At Update: Eto Is 7 P.M. To Reopen Eastbound I...,
1,519952600498593792,CaltransDist3,2014-10-08 20:48:54+00:00,Eastbound Interstate 80 closed between Auburn ...,,,official,At Eastbound Interstate 80 Closed At Auburn An...,
2,519937459191578625,CaltransDist3,2014-10-08 19:48:44+00:00,Hwy 113 closed at George Washington in Sutter ...,,,official,At Hwy 113 Closed At George Washington At Sutt...,
3,519877652728274945,CaltransDist3,2014-10-08 15:51:05+00:00,@D3PIO update hwy 162 OT big rig is near Butte...,,,official,At @D3Pio Update Highway 162 Ot Big Rig Is Ne...,
4,519876511751749632,CaltransDist3,2014-10-08 15:46:33+00:00,Eastbound 80 at Penryn Rd #2&3 lanes blocked d...,#2,,official,At Eastbound 80 At Penryn Rd #2&3 Lanes Blocke...,


In [27]:
df['location']

0        
1        
2        
3        
4        
       ..
4125     
4126     
4127     
4128     
4129     
Name: location, Length: 4130, dtype: object

In [None]:
spacy_loc(df,'tweet','location')
df['location'].head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


## Summary

We attempted to implement spaCy into our project to extract street names in order to plot specific latitude and longitude points within Google and/or Here.com's API’s. 
We noticed SpaCy required a significant amount of model training in order to recognize the street names let alone extract this information and communicate it effectively to Google or Here.com. With thousands of streets we would need to manually input from our Tweets, we chose to look for alternatives. 
