# Correlating Language to Geographic Location

After parsing a significant amount of geographically-linked phrases, now we'd like a way to predict the geographic origin of a speaker or writer via their language. Since no dataset can possibly cover the extent of U.S. dialects, we ought to utilize certain machine learning techniques to predict geographic origin even when the raw data is not entirely conclusive.

In order to accomplish this, we will first vectorize words and sentences, then utilize a Naive Bayes classifier (scikit-learn's implementation).

Please refer to and run `geodare.json`, which converts the raw DARE corpus into a more usable format, prior to using this notebook.

In [11]:
import sklearn
import numpy as np
import pandas as pd
import nltk
import sys

In [2]:
'''Read in the cleaned DARE corpus
'''
geodata = pd.read_csv("../data/cleaned_dare_corpus.csv")

In [3]:
'''Create classification categories, i.e. target names
'''
catagories = []
for x in geodata['dialect']:
    if x not in catagories:
        catagories.append(x)

In [4]:
'''Create data point and target lists
'''
examples = []
targets = []

for i,x in enumerate(geodata['word']):
    if pd.notnull(x):
        examples.append(x)
        dialect = geodata.get_value(i,'dialect')
        target = catagories.index(dialect)
        targets.append(target)        

['gooselock', 'swale', 'tickbird', 'twistification', 'ahkio', 'babiche', 'banya', 'barabara', 'bidar', 'bingle', 'cheechako', 'chy', 'hiyu', 'hoochinoo', 'hooligan', 'icta', 'kamleika', 'kelpfish', 'klootchman', 'lagoonberry', 'lingonberry', 'makoola', 'mashu', 'moroshka', 'mossberry', 'mukluk', 'muktuk', 'musher', 'muskeg', 'nagoonberry', 'needlefish', 'niggerhead', 'nushnik', 'oogruk', 'panguingue', 'petrushki', 'pickerel', 'pirok', 'pogy', 'poque', 'potlatch', 'puchki', 'pulka', 'pupmobile', 'redberry', 'redfish', 'salmonberry', 'sarana', 'sheefish', 'siwash', 'skijoring', 'skookum', 'snowmachine', 'soapberry', 'taku', 'ulu', 'wanigan', 'washateria', 'williwaw', 'wineberry', 'ackempucky', 'babiche', 'bogan', 'chinquapin', 'chogset', 'coakum', 'cushaw', 'eatup', 'hominy', 'killhag', 'kinnikinnick', 'kiskitomas', 'kyack', 'maninose', 'mechameck', 'methy', 'moosemise', 'mummichog', 'muskeg', 'muskellunge', 'musquash', 'namaycush', 'nitchie', 'pauhagen', 'pegmonk', 'pekan', 'persimmon',

In [5]:
'''Sanity check: the number of data points should match the number of targets.

    print("data: ",len(examples))
    print("targets: ",len(targets))
 
    Then, bunch the data points to their targets.
'''

training = sklearn.datasets.base.Bunch(target=targets, data=examples, target_names=catagories)
print(training.target_names[training.target[36954]])

data:  36955
targets:  36955
south carolina


In [21]:
df = pd.DataFrame(training.data)
df.head()

Unnamed: 0,0
0,gooselock
1,swale
2,tickbird
3,twistification
4,ahkio


In [25]:
X_train_counts = count_vect.fit_transform(df.astype('U').values.ravel())
X_train_counts.shape

(39792, 2952)

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(39792, 2952)