# Assignment 1

In this assignment we worked with messy medical data and used regex to extract relevant information from the data. 

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats. The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates according to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").

In [1]:
import pandas as pd

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [2]:
def date_sorter():
    # Define the regular expression pattern to extract dates
    import re
    pattern = r'''
            (\d{1,2}[/-]\d{1,2}[/-]\d{2,4}         # Match date format 04/20/2009; 04/20/09; 4/20/09; 4/3/09
            |                                      # OR
            (?:\d{1,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[\s.,-]*(?:\d{1,2}[a-z]*[\s,-]*)?\d{4} # Match date format Mar-20-2009;20 Mar 2009; Mar 20th; Feb 2009 etc                          
            |                                      # OR
            \d{1,2}[/]\d{2,4}                      # Match date format MM/YYYY or MM/YY
            |                                      # OR
            (?<!-)\d{4})                           # Match date format YYYY
        '''
    # Extract dates from the text using the regular expression
    date = df.str.extract(pattern, re.VERBOSE)
    # Replace two-digit years with corresponding 1900s years
    date = date.replace(r'/(\d{2})$', r'/19\1', regex=True)
    # fixing mispellings
    date = date.replace(r'Janaury', r'January', regex=True)
    date = date.replace(r'Decemeber', r'December', regex=True)

    # Convert the dates to datetime format for sorting
    dates = pd.to_datetime(date[0], errors='coerce')

    # Create a DataFrame with dates and the original index
    df_dates = pd.DataFrame({'date': dates, 'index': df.index})
    # Sort the DataFrame first by date and then by index in ascending order
    sorted_df = df_dates.sort_values(by=['date', 'index'], ascending=[True, True]).reset_index(drop=True)
    
    return sorted_df 

date_sorter()

Unnamed: 0,date,index
0,1971-04-10,9
1,1971-05-18,84
2,1971-07-08,2
3,1971-07-11,53
4,1971-09-12,28
...,...,...
495,2016-05-01,427
496,2016-05-30,141
497,2016-10-13,186
498,2016-10-19,161


# Assignment 2 - Introduction to NLTK

In this assignment we created a spelling recommender function that uses nltk to find words similar to the misspelling. For every misspelled word, the recommender should find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.


In [5]:
import nltk
from nltk.corpus import words
nltk.download('words')

correct_spellings = words.words()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [6]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    # your code goes here
    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    # YOUR CODE HERE
    outcomes = []
    for entry in entries:
        spellings = [s for s in correct_spellings if s.startswith(entry[0])]
        distances = ((jaccard_distance(set(ngrams(entry, 3)),
                                       set(ngrams(word, 3))), word) for word in spellings)
        closest = min(distances)
        outcomes.append(closest[1])
    return outcomes# Your answer here
    
answer_nine()

['corpulent', 'indecence', 'validate']

# Assignment 3

In this assignment we explored text message data and created models to predict if a message is spam or not. The task was the following:

Fit and transform the **first 2000 rows** of training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.** To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* number of digits per document
* **number of non-word characters (anything other than a letter, digit or underscore.)**

fit a Logistic Regression model with regularization C=100 and max_iter=1000. Then compute the area under the curve (AUC) score using the transformed test data.

Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple. The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first. The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients:
['length_of_doc', 'digit_count', 'non_word_char_count']


In [10]:
import pandas as pd
import numpy as np

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [11]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

In [14]:
import re
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score

def answer_eleven():
    
    vect = CountVectorizer(min_df=5, ngram_range=(2, 5), analyzer='char_wb').fit(X_train[:2000])
    X_train_vectorized = vect.transform(X_train[:2000]) 
    
    # Calculate the length of documents, number of digits, and non-word characters, and add them as features
    length_of_doc = [len(doc) for doc in X_train[:2000]]
    digit_count = [len(re.findall(r'\d', doc)) for doc in X_train[:2000]]
    non_word_char_count = [len(re.findall(r'\W', doc)) for doc in X_train[:2000]]
    X_train_vectorized = add_feature(X_train_vectorized, [length_of_doc,digit_count,non_word_char_count])

    # Fit the Logistic Regression model
    model = LogisticRegression(C=100, max_iter=1000)
    model.fit(X_train_vectorized, y_train[:2000])

    # Transform the test data and calculate AUC score
    X_test_vectorized = vect.transform(X_test)
    length_of_doc = [len(doc) for doc in X_test]
    digit_count = [len(re.findall(r'\d', doc)) for doc in X_test]
    non_word_char_count = [len(re.findall(r'\W', doc)) for doc in X_test]    
    X_test_vectorized = add_feature(X_test_vectorized, [length_of_doc,digit_count,non_word_char_count])

    probabilities = model.predict_proba(X_test_vectorized)[:, 1]
    
    # Combine CountVectorizer feature names with custom feature names
    count_vectorizer_feature_names = np.array(vect.get_feature_names_out())
    custom_feature_names = ['length_of_doc', 'digit_count', 'non_word_char_count']
    feature_names = np.concatenate((count_vectorizer_feature_names, custom_feature_names))
    
    # Find the 10 smallest and 10 largest coefficients from the model
    sorted_coef_index = model.coef_[0].argsort()
    smallest_coefs = list(feature_names[sorted_coef_index[:10]])
    largest_coefs = list(feature_names[sorted_coef_index[:-11:-1]])
    
    return roc_auc_score(y_test, probabilities), smallest_coefs, largest_coefs

def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

answer_eleven()

(0.9975680355839261,
 ['n ', ' i', 'at', 'he', ' m', '..', 'us', 'go', ' lo', ' bu'],
 ['digit_count', 'ne', ' st', 'co', 's ', 'xt', 'lt', 'xt ', ' ne', 'der'])

# Assignment 4 - Document Similarity & Topic Modelling

For  this assignment, we used Gensim's LDA (Latent Dirichlet Allocation) model to model topics.

In [15]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [16]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word = id_map, passes = 25, random_state = 34)

In [17]:
def lda_topics():
    """
    find a list of the 10 topics and the most significant 10 words in each topic
    """
    return ldamodel.print_topics(num_topics=10, num_words=10)

lda_topics()

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

Also, we Found the topic distribution for the new document `new_doc`.

In [18]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [19]:
def topic_distribution():
    
    new_doc_transformed = vect.transform(new_doc)
    corpus = gensim.matutils.Sparse2Corpus(new_doc_transformed, documents_columns=False)
    doc_topics = ldamodel.get_document_topics(corpus)
    topic_dist = []
    for val in list(doc_topics):
        for v in val:
            topic_dist.append(v)
    return topic_dist

topic_distribution()

[(0, 0.020003106),
 (1, 0.020003323),
 (2, 0.02000128),
 (3, 0.4967471),
 (4, 0.020004036),
 (5, 0.020004127),
 (6, 0.02000297),
 (7, 0.020002643),
 (8, 0.020003127),
 (9, 0.34322825)]

it is obvious that the most relevant topics are topic 3 (which is not clear to what is related) and topic 9 which is related Science.