# Correlating Language to Geographic Location

After parsing a significant amount of geographically-linked phrases, now we'd like a way to predict the geographic origin of a speaker or writer via their language. Since no dataset can possibly cover the extent of U.S. dialects, we ought to utilize certain machine learning techniques to predict geographic origin even when the raw data is not entirely conclusive.

In order to accomplish this, we will first vectorize words and sentences, then utilize a Naive Bayes classifier (scikit-learn's implementation).

Please refer to and run `geodare.json`, which converts the raw DARE corpus into a more usable format, prior to using this notebook.

In [1]:
import sklearn
import numpy as np
import pandas as pd
import nltk

In [2]:
'''Read in the cleaned DARE corpus
'''
geodata = pd.read_csv("../data/cleaned_dare_corpus.csv")

In [3]:
'''Create classification categories, i.e. target names
'''
catagories = []
for x in geodata['dialect']:
    if x not in catagories:
        catagories.append(x)

In [4]:
'''Create data point and target lists
'''
examples = []
targets = []

for i,x in enumerate(geodata['word']):
    if pd.notnull(x):
        examples.append(x)
        dialect = geodata.get_value(i,'dialect')
        target = catagories.index(dialect)
        targets.append(target)        

In [5]:
'''Sanity check: the number of data points should match the number of targets.

    print("data: ",len(examples))
    print("targets: ",len(targets))
 
    Then, bunch the data points to their targets.
'''

training = sklearn.datasets.base.Bunch(target=targets, data=examples, target_names=catagories)
print(training.target_names[training.target[36954]])

south carolina


In [6]:
df = pd.DataFrame(training.data)
df.astype('U').values.ravel()
df.head()

Unnamed: 0,0
0,gooselock
1,swale
2,tickbird
3,twistification
4,ahkio


In [7]:
'''Vectorize text data by the number of occurances (count)
'''
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.astype('U').values.ravel())
X_train_counts.shape


(36955, 2951)

In [8]:
'''Normalize vectorized data with Term Frequency times Inverse Document Frequency (tfidf)
'''
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(36955, 2951)

In [9]:
'''Train a Naive Bayes classifier on the existing data
'''
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train_tfidf, training.target)

In [10]:
'''Example of utilizing the NB classifier to predict the geographic origin of untrained data points.
'''
docs_new = ['What a gooselock!', 'Scoot your tush over.', "That's a hosey. Don't barf up your frappe."]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = classifier.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, training.target_names[category]))

'What a gooselock!' => alabama
'Scoot your tush over.' => virginia
"That's a hosey. Don't barf up your frappe." => massachusetts
