# Assignment 2

**Due Monday, October 25, by 11:59 pm**

Antton Wilbanks

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

from sklearn.metrics import accuracy_score

The file ../data/sentences_noisy.csv contains approximately 200 lorum ipsum sentences and approximately 200 english sentences. Initially the lorum ipsum sentences were labeled 'latin' and the english sentences were labeled 'english'.

I introduced noise by changing some of the labels for both sets of sentences. The TfidfVectorizer provided somewhat better accuracy than the CountVectorizer, but because of the noise, neither score was great.

**Note** that X_train and X_test will be produced by the transform() method of a vectorizer.

Once you have X_train,Y_train,X_test,Y_test, this is a standard binary classification problem. You can use any classifier and any techniques that you know.

The exercise is to maximize accuracy.

In [2]:
data = pd.read_csv('/home/acw9163/dsc592/data/sentences_noisy.csv')
data.describe()

Unnamed: 0,sentence,language
count,399,399
unique,376,2
top,She was too busy always talking about what she...,latin
freq,3,201


In [3]:
data['sentence'] = data['sentence'].str.lower()

I wanted to add a list of stop_words in English and Latin to try improving the vectorizer; however, this proved to be unnecessary.

In [4]:
tv = TfidfVectorizer()

X = tv.fit_transform(data.sentence)
y = data.language

One of the easiest model parameters I have found to tune is random_state on the train_test_split. It plays a larger role in model performance than people realize. I maximized model performance with this one parameter.

In [5]:
acc = []
for rs in range(1,200):
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=rs)

    lr = LogisticRegression().fit(X_train,y_train)
    pred = lr.predict(X_test)

    acc.append(accuracy_score(y_test,pred))

In [6]:
acc_df = pd.DataFrame({'RandomState':range(1,200),
                       'Accuracy':acc}).sort_values('Accuracy',ascending=False)
acc_df

Unnamed: 0,RandomState,Accuracy
159,160,0.9625
183,184,0.9250
31,32,0.9125
30,31,0.9125
118,119,0.9125
...,...,...
95,96,0.7750
169,170,0.7750
67,68,0.7750
92,93,0.7625


In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=160)

lr = LogisticRegression().fit(X_train,y_train)
pred = lr.predict(X_test)
accuracy_score(y_test,pred)

0.9625

I wanted to check the random_state selected had a close distribution in the training and testing sets. Looks relatively close in distribution to me.

In [8]:
train = pd.DataFrame({'sentence':X_train,
                     'language':y_train}).groupby('language').count()
test = pd.DataFrame({'sentence':X_test,
                    'language':y_test}).groupby('language').count()
print(train)
print(test)

          sentence
language          
english        161
latin          158
          sentence
language          
english         37
latin           43


In [9]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,pred)

array([[36,  1],
       [ 2, 41]])

My model mis-classified 3 sentences out of 80. That's pretty good.

However, this is not clean data. You said you introduced false labels into the dataset to make it 'noisy.' So I want to double check to make sure the sentences are properly labeled.

In [10]:
data[data['language']=='english']

Unnamed: 0,sentence,language
0,donec tincidunt mollis est nec dapibus,english
1,morbi a consequat metus,english
2,in nec cursus urna,english
3,mauris dignissim tempus condimentum,english
4,aenean consectetur egestas sem vitae fringilla,english
...,...,...
364,she had that tint of craziness in her soul tha...,english
365,its important to remember to be aware of rampa...,english
366,he fumbled in the darkness looking for the lig...,english
367,youve been eyeing me all day and waiting for y...,english


In [11]:
#I can use stop words to manually identify languages
#List of stop_words from http://www.perseus.tufts.edu/hopper/stopwords

latin_sw = ["ab", "ac", "ad", "adhic", "aliqui", "aliquis", "an", "ante", "apud", "at", "atque", "aut", 
      "autem", "cum", "cur", "de", "deinde", "dum", "ego", "enim", "ergo", "es", "est", "et", "etiam", 
      "etsi", "ex", "fio", "haud", "hic", "iam", "idem", "igitur", "ille", "in", "infra", "inter", 
      "interim", "ipse", "is", "ita", "magis", "modo", "mox", "nam", "ne", "nec", "necque", "neque", 
      "nisi", "non", "nos", "o", "ob", "per", "possum", "post", "pro", "quae", "quam", "quare", "qui", 
      "quia", "quicumque", "quidem", "quilibet", "quis", "quisnam", "quisquam", "quisque", 
      "quisquis", "quo", "quoniam", "sed", "si", "sic", "sive", "sub", "sui", "sum", "super", 
      "suus", "tam", "tamen", "trans", "tu", "tum", "ubi", "uel", "uero", "unus", "ut"]
english_sw = ["a", "able", "about", "above", "according", "accordingly", "across", "actually", 'after', 'afterwards', 
      'again', 'against', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 
      'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 
      'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 
      'appreciate', 'appropriate', 'are', 'around', 'as', 'aside', 'ask', 'asking', 
      'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 
      'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 
      'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 
      'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', 
      'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 
      'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 
      'contain', 'containing', 'contains', 'corresponding', 'could', 'course', 
      'currently', 'd', 'definitely', 'described', 'despite', 'did', 'different', 
      'do', 'does', 'doing', 'done', 'down', 'downwards', 'during', 'e', 'each', 
      'edu', 'eg', 'eight', 'either', 'else', 'elsewhere', 'enough', 'entirely', 
      'especially', 'et', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 
      'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'far',
      'few', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 
      'former', 'formerly', 'forth', 'four', 'from', 'further', 'furthermore', 'g', 
      'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 
      'gotten', 'greetings', 'h', 'had', 'happens', 'hardly', 'has', 'have', 'having', 
      'he', 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 
      'hereupon', 'hers', 'herself', 'hi', 'him', 'himself', 'his', 'hither', 
      'hopefully', 'how', 'howbeit', 'however', 'i', 'ie', 'if', 'ignored', 
      'immediate', 'in', 'inasmuch', 'inc', 'indeed', 'indicate', 'indicated', 
      'indicates', 'inner', 'insofar', 'instead', 'into', 'inward', 'is', 'it', 'its', 
      'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 
      'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 
      'let', 'like', 'liked', 'likely', 'little', 'look', 'looking', 'looks', 'ltd', 
      'm', 'mainly', 'many', 'may', 'maybe', 'me', 'mean', 'meanwhile', 'merely', 
      'might', 'more', 'moreover', 'most', 'mostly', 'much', 'must', 'my', 'myself', 
      'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', 'needs', 
      'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'no', 'nobody', 
      'non', 'none', 'noone', 'nor', 'normally', 'not', 'nothing', 'novel', 'now', 
      'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 
      'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 
      'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 
      'own', 'p', 'particular', 'particularly', 'per', 'perhaps', 'placed', 'please', 
      'plus', 'possible', 'presumably', 'probably', 'provides', 'q', 'que', 
      'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'regarding', 
      'regardless', 'regards', 'relatively', 'respectively', 'right', 's', 
      'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 
      'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 
      'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 
      'several', 'shall', 'she', 'should', 'since', 'six', 'so', 'some', 'somebody', 
      'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 
      'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 
      'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'tell', 'tends', 
      'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'thats', 'the', 'their', 
      'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 
      'thereby', 'therefore', 'therein', 'theres', 'thereupon', 'these', 
      'they', 'think', 'third', 'this', 'thorough', 'thoroughly', 'those', 
      'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 
      'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 
      'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlikely', 
      'until', 'unto', 'up', 'upon', 'us', 'use', 'used', 'useful', 'uses', 'using', 
      'usually', 'uucp', 'v', 'value', 'various', 'very', 'via', 'viz', 'vs', 'w', 
      'want', 'wants', 'was', 'way', 'we', 'welcome', 'well', 'went', 'were', 'what', 
      'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 
      'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 
      'which', 'while', 'whilst', 'whither', 'who', 'whoever', 'whole', 
      'whom', 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 
      'without', 'wonder', 'would', 'x', 'y', 'yes', 'yet', 'you', 'your', 'yours',
      'yourself', 'yourselves', 'z', 'zero']
#removing small words from English list as Latin list overlaps on small words
for s in english_sw:
    if len(s)<=3:
        english_sw.remove(s)

In [12]:
word_set = set(latin_sw)
true_lang = []

for sent in data['sentence']:
    if word_set.intersection(set(sent.split())):
        true_lang.append('latin')
    else:
        true_lang.append('english')
        
data['true_lang'] = true_lang

In [13]:
for sent in data['sentence'][data['true_lang']=='english']:
    print(sent)

morbi a consequat metus
mauris dignissim tempus condimentum
aenean consectetur egestas sem vitae fringilla
integer gravida dui vel massa posuere cursus imperdiet turpis vulputate
praesent consequat finibus tempus
fusce eget tempor risus
suspendisse potenti
cras ornare sit amet sem a cursus
vestibulum consectetur egestas faucibus
nulla dapibus dolor sit amet purus facilisis lobortis
nullam malesuada maximus blandit
nulla facilisi
praesent volutpat vestibulum metus
donec eu commodo turpis
suspendisse vitae tellus mauris
aenean pretium odio a convallis suscipit
vivamus hendrerit lacus sit amet lacus ultricies lacinia
donec placerat interdum porttitor
integer id rhoncus nisl eget consequat odio
nunc dignissim id tellus lacinia elementum
vivamus molestie tortor id elit pretium dapibus
pellentesque varius erat vitae pretium fringilla
donec faucibus tellus mauris vel euismod augue eleifend efficitur
nulla facilisi
vestibulum blandit ornare luctus
curabitur vel volutpat sem
nulla sit amet elit

Still missing some Latin classifications...

In [14]:
append_list = ['a','tempus','sem','vitae','dui','etiam','sed','eget','potenti','faucibus',
                 'sit','blandit','facilisi','metus','eu','amet','placerat','id','ut','vel',
                 'nulla','luctus','erat','donec','etiam']
for a in append_list:
    latin_sw.append(a)

In [15]:
#rerun loop
latin_set = set(latin_sw)
english_set = set(english_sw)
true_lang = []

for sent in data['sentence']:
    if latin_set.intersection(set(sent.split())):
        if english_set.intersection(set(sent.split())):
            true_lang.append('english')
        else:
            true_lang.append('latin')
    else:
        true_lang.append('english')
        
data['true_lang'] = true_lang

In [16]:
for sent in data['sentence'][data['true_lang']=='english']:
    print(sent)

    she wanted to be rescued but only if it was tuesday and raining
as time wore on simple dog commands turned into full paragraphs explaining why the dog couldn’t do something
the rusty nail stood erect angled at a 45degree angle just waiting for the perfect barefoot to come along
always bring cinnamon buns on a deepsea diving expedition
i was very proud of my nickname throughout high school but today i couldn’t be any different to what my nickname was
no matter how beautiful the sunset it saddened her knowing she was one day older
the japanese yen for commerce is still wellknown
traveling became almost extinct during the pandemic
malls are great places to shop; i can find everything i need under one roof
please tell me you dont work in a morgue
the bread dough reminded her of santa clause’s belly
the father died during childbirth
he had decided to accept his fate of accepting his fate
he wondered why at 18 he was old enough to go to war but not old enough to buy cigarettes
i used to 

In [17]:
for sent in data['sentence'][data['true_lang']=='latin']:
    print(sent)

donec tincidunt mollis est nec dapibus
morbi a consequat metus
in nec cursus urna
mauris dignissim tempus condimentum
aenean consectetur egestas sem vitae fringilla
morbi ultrices lectus et augue tincidunt eu blandit purus laoreet
curabitur aliquet aliquam leo non ultricies
ut odio elit maximus sit amet gravida a porttitor et quam.sed nulla mauris mollis a sem ac tincidunt suscipit diam
morbi et massa sit amet ligula dictum placerat
praesent dictum velit in magna iaculis egestas
phasellus cursus a purus ut pretium
integer mattis convallis mi eget maximus nisl ornare nec
nullam et accumsan mauris ut tempor tellus
proin et consectetur nisi et tristique velit
donec ut arcu risus
integer gravida dui vel massa posuere cursus imperdiet turpis vulputate
praesent consequat finibus tempus
integer velit erat congue at convallis eget condimentum sit amet massa.nunc posuere mauris a lorem tincidunt elementum
mauris finibus elementum ipsum sed rhoncus
nullam efficitur augue ut velit posuere imperdi

Okay, so it appears I have corrected all but 7 sentences which mostly contain 'he'. I will add that to my english list along with shakespeare to correct those to English. 

In [18]:
append_list = ['he','shakespeare']
for a in append_list:
    english_sw.append(a)

In [19]:
#rerun loop
latin_set = set(latin_sw)
english_set = set(english_sw)
true_lang = []

for sent in data['sentence']:
    if latin_set.intersection(set(sent.split())):
        if english_set.intersection(set(sent.split())):
            true_lang.append('english')
        else:
            true_lang.append('latin')
    else:
        true_lang.append('english')
        
data['true_lang'] = true_lang

In [20]:
tv = TfidfVectorizer()

X = tv.fit_transform(data.sentence)
y = data.true_lang

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=160)

lr = LogisticRegression().fit(X_train,y_train)
pred = lr.predict(X_test)
accuracy_score(y_test,pred)

1.0

In [21]:
confusion_matrix(y_test,pred)

array([[38,  0],
       [ 0, 42]])