# Correlating Language to Geographic Location

After parsing a significant amount of geographically-linked phrases, now we'd like a way to predict the geographic origin of a speaker or writer via their language. Since no dataset can possibly cover the extent of U.S. dialects, we ought to utilize certain machine learning techniques to predict geographic origin even when the raw data is not entirely conclusive.

In order to accomplish this, we will first vectorize words and sentences, then utilize a Naive Bayes classifier (scikit-learn's implementation).

Please refer to and run `geodare.json`, which converts the raw DARE corpus into a more usable format, prior to using this notebook.

In [20]:
import sklearn
import numpy as np
import pandas as pd
import nltk

In [24]:
geodata = pd.read_csv("../data/cleaned_dare_corpus.csv")

In [25]:
catagories = []
for x in geodata['dialect']:
    if x not in catagories:
        catagories.append(x)

In [30]:
examples = []
targets = []

for i,x in enumerate(geodata['word']):
    if pd.notnull(x):
        examples.append(x)
        dialect = geodata.get_value(i,'dialect')
        target = catagories.index(dialect)
        targets.append(target)        

In [32]:
# Sanity check: the number of data points should match the number of data targets
print("data: ",len(examples))
print("targets: ",len(targets))

# Bunch the data points to their targets
training = sklearn.datasets.base.Bunch(target=targets, data=examples, target_names=catagories)
print(training.target_names[training.target[36954]])

data:  36955
targets:  36955
south carolina


In [36]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training.data)
X_train_counts.shape


(36955, 2951)