You can find the data for this exercise in the wikipedia/featured-articles/ directory of the course’s data download.

1. Preparing Text

For this part, you will start with the logistic regression model for the categorized-comments.jsonl with an L1 penalty you used in the previous exercise. Using this model, you will compare the effect of different text processing methods on the accuracy of the model (as computed on the test set).

Each of the following steps is additive, meaning later steps combine all the methods from the prior steps.

Convert all text to lower case letters
Remove all punctuation from the text
Remove stop words
Apply NLTK’s PorterStemmer
Use a Tf-idf vector instead of the word frequency vector
In addition to computing the accuracy, compute the receiver operating characteristic area under the curve (ROC AUC). The following code demonstrates how to calculate the ROC AUC assuming you have already trained a model.

calculate roc auc

Make sure you use the same test and training sets for training and evaluating all models. Report your results in a table with the following format.

reporting format

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np

In [4]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
import string
from nltk.stem.porter import *

In [3]:
json_path = 'C:\\Users\\Dan Siegel\\Desktop\\Classes\\550\\data\\reddit\\categorized-comments.jsonl'

In [4]:
data = []
with open(json_path) as f:
    for line in f:
        data.append(json.loads(line))

In [92]:
df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')

In [None]:
def text_process(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    nopunc = nopunc.lower()
    no_stopwords = [word for word in nopunc.split() if word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    no_stopwords = [stemmer.stem(word) for word in no_stopwords]
    stemmed_no_stopwords = ' '.join(no_stopwords)
    return stemmed_no_stopwords

In [86]:
import collections

In [6]:
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer

In [113]:
def text_to_lower(words):
    return words.lower()
def remove_punctuation(words):
    words = [word for word in words if word not in string.punctuation]
    words = ''.join(words)
    return words
def remove_stop_words(words):
    return ' '.join([w for w in nltk.word_tokenize(words)  if not w in stop_words] )
def apply_stemming(words):
    stemmer = PorterStemmer()
    return ''.join([stemmer.stem(word) for word in words])
def _calculate_term_frequencies(words):
    word_counter = collections.Counter(words)
    total_words = sum(word_counter.values())
    # TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
    return {
        term: word_counter.get(term, 0.0)/total_words
        for term in word_counter.keys()
      }

In [152]:
df['lower_case'] = df['txt'].apply(text_to_lower)

In [98]:
df['no_punct'] = df['lower_case'].apply(remove_punctuation)

In [104]:
df['no_stop_words'] = df['no_punct'].apply(remove_stop_words)

In [115]:
df['stemmed'] = df['no_stop_words'].apply(apply_stemming)

In [117]:
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()

In [118]:
vectorizer = TfidfVectorizer()

In [120]:
vectd_tfidf = vectorizer.fit_transform(df.stemmed).toarray()

In [125]:
lower_case_vec = count_vec.fit_transform(df['lower_case']).toarray()
stemmed_vec = count_vec.fit_transform(df['stemmed']).toarray()
no_stop_vec = count_vec.fit_transform(df['no_stop_words']).toarray()
no_punc_vec = count_vec.fit_transform(df['no_punct']).toarray()

In [157]:
mod_results = []
transformations_to_test = [vectd_tfidf, lower_case_vec, stemmed_vec, no_stop_vec, no_punc_vec]
transformation_names = ['vectd_tfidf', 'lower_case_vec', 'stemmed_vec', 'no_stop_vec', 'no_punc_vec']

In [155]:
for transformation, transformation_name in transformations_to_test, transformation_names:
    X_train, X_test, y_train, y_test = train_test_split(transformation, df.cat, test_size=0.25, random_state=101)
    classifier = LogisticRegression(penalty='l1')
    mod = classifier.fit(X_train, y_train)
    test_pred = mod.predict(X_test)
    accuracy_test = accuracy_score(y_test, test_pred)
    y_predict = mod.predic_proba(X_test)
    y_scores = [value for _, in y_predict]
    auc = roc_auc_score(y_test, y_scores)
    mod_acc = {'Transformation': transformation_name, 'Accuracy': accuracy_test, 'AUC': auc}
    mod_results.append(mod_acc)

In [None]:
mod_results

Working with Dates and Times
In this part, you will be working with weblogs originally obtained from NASA Kennedy Space Center web server. You can find a log file from August 1995 in the nasa-http folder of the course’s dataset. Each line of the file represents one request to the server. The following is a description of the data contained within each line.

The hostname or IP address of the computer making the request.
The time of the request in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
The HTTP request method and the requested URL.
The HTTP status code of the server.
The number of bytes in the reply.
The following is an example of Python code to parse a single entry in the log file. Use this code to create a Pandas data frame with the fields host, timestamp, request_method, request_url, reply_code, and reply_bytes.



In [2]:
import re
from datetime import datetime
from collections import OrderedDict

In [1]:
log_path = 'C:\\Users\\Dan Siegel\\Desktop\\Classes\\550\\data\\nasa-http\\NASA_access_log_Aug95.log'

In [5]:
log_line_regex = re.compile(''.join([
    r'^(?P<host>[\S]+)\s-\s-\s', r'\[(?P<timestamp>.{26})\]',
    r'\s"(?P<request_method>[A-Z]{3,4})\s(?P<request_url>.{1,100})(\sHTTP/1.0)"?',
    r'\s(?P<reply_code>[0-9]{3})\s(?P<reply_bytes>[0-9-]{1,20})$'
]))

In [255]:
line = 'd0ucr6.fnal.gov - - [01/Aug/1995:00:00:20 -0400] "GET /history/apollo/apollo-16/apollo-16-patch-small.gif HTTP/1.0" 200 14897'
m = log_line_regex.match(line)
record = OrderedDict([
    (key, value)
    for key, value in m.groupdict().items()
])
record['timestamp'] = datetime.strptime(
    record['timestamp'],
    '%d/%b/%Y:%H:%M:%S %z'
)

In [256]:
record

OrderedDict([('host', 'd0ucr6.fnal.gov'),
             ('timestamp',
              datetime.datetime(1995, 8, 1, 0, 0, 20, tzinfo=datetime.timezone(datetime.timedelta(-1, 72000)))),
             ('request_method', 'GET'),
             ('request_url',
              '/history/apollo/apollo-16/apollo-16-patch-small.gif'),
             ('reply_code', '200'),
             ('reply_bytes', '14897')])

In [6]:
nasa_data = []
with open(log_path) as f:
    for line in f:
        m = log_line_regex.match(line)
        if m:
            record = OrderedDict([
                (key, value)
                for key, value in m.groupdict().items()
            ])
            record['timestamp'] = datetime.strptime(record['timestamp'],'%d/%b/%Y:%H:%M:%S %z')
            nasa_data.append(record)

In [262]:
nasa_data[0]

OrderedDict([('host', 'uplherc.upl.com'),
             ('timestamp',
              datetime.datetime(1995, 8, 1, 0, 0, 7, tzinfo=datetime.timezone(datetime.timedelta(-1, 72000)))),
             ('request_method', 'GET'),
             ('request_url', '/'),
             ('reply_code', '304'),
             ('reply_bytes', '0')])

In [7]:
nasa_df = pd.DataFrame.from_records(nasa_data)

In [286]:
nasa_df['timestamp'] = pd.to_datetime(nasa_df['timestamp'])

In [35]:
nasa_df['date'] = nasa_df['timestamp'].apply(lambda x: x.date)

In [9]:
times = pd.DatetimeIndex(nasa_df.timestamp)

In [10]:
nasa_df['weekday'] = times.weekday_name

In [59]:
weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

In [72]:
weekday_aggregates = []

In [75]:
for weekday in weekdays:
    weekday_min = nasa_df[nasa_df['weekday']==weekday].date.value_counts().min()
    weekday_max = nasa_df[nasa_df['weekday']==weekday].date.value_counts().max()
    weekday_mean = nasa_df[nasa_df['weekday']==weekday].date.value_counts().mean()
    totals = {'Weekday': weekday, 'Requests(Max)': weekday_max,'Requests(Min)':weekday_min , 'Requests(Mean)': weekday_mean}
    weekday_aggregates.append(totals)

In [76]:
weekday_aggregated_df = pd.DataFrame.from_dict(weekday_aggregates)

In [77]:
weekday_aggregated_df

Unnamed: 0,Requests(Max),Requests(Mean),Requests(Min),Weekday
0,36435,33625.5,32373,Sunday
1,59805,56984.0,55381,Monday
2,67855,55643.8,33932,Tuesday
3,80222,63805.75,56594,Wednesday
4,89731,60723.6,41333,Thursday
5,61165,58533.5,56197,Friday
6,37990,33356.0,31576,Saturday


In [78]:
nasa_df['time'] = nasa_df['timestamp'].apply(lambda x: x.time)

In [87]:
nasa_df[nasa_df['time']==datetime(00:00:07-04:00)]

SyntaxError: invalid syntax (<ipython-input-87-9d2923ff180e>, line 1)

In [104]:
time_column = nasa_df[['timestamp', 'time']]

In [107]:
time_column.between_time('0:15', '0:45')

TypeError: Index must be DatetimeIndex

In [109]:
nasa_df['hour']= nasa_df['time'].apply(lambda x: x.hour)

In [117]:
nasa_df.index= pd.to_datetime(nasa_df.timestamp)

In [141]:
nasa_df.between_time(start_time='00:00:00', end_time='03:00:00')

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [120]:
nasa_df

Unnamed: 0_level_0,host,timestamp,request_method,request_url,reply_code,reply_bytes,weekday,date,time,hour
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1995-08-01 00:00:07-04:00,uplherc.upl.com,1995-08-01 00:00:07-04:00,GET,/,304,0,Tuesday,1995-08-01,00:00:07-04:00,0
1995-08-01 00:00:08-04:00,uplherc.upl.com,1995-08-01 00:00:08-04:00,GET,/images/ksclogo-medium.gif,304,0,Tuesday,1995-08-01,00:00:08-04:00,0
1995-08-01 00:00:08-04:00,uplherc.upl.com,1995-08-01 00:00:08-04:00,GET,/images/MOSAIC-logosmall.gif,304,0,Tuesday,1995-08-01,00:00:08-04:00,0
1995-08-01 00:00:08-04:00,uplherc.upl.com,1995-08-01 00:00:08-04:00,GET,/images/USA-logosmall.gif,304,0,Tuesday,1995-08-01,00:00:08-04:00,0
1995-08-01 00:00:09-04:00,ix-esc-ca2-07.ix.netcom.com,1995-08-01 00:00:09-04:00,GET,/images/launch-logo.gif,200,1713,Tuesday,1995-08-01,00:00:09-04:00,0
1995-08-01 00:00:10-04:00,uplherc.upl.com,1995-08-01 00:00:10-04:00,GET,/images/WORLD-logosmall.gif,304,0,Tuesday,1995-08-01,00:00:10-04:00,0
1995-08-01 00:00:10-04:00,slppp6.intermind.net,1995-08-01 00:00:10-04:00,GET,/history/skylab/skylab.html,200,1687,Tuesday,1995-08-01,00:00:10-04:00,0
1995-08-01 00:00:10-04:00,piweba4y.prodigy.com,1995-08-01 00:00:10-04:00,GET,/images/launchmedium.gif,200,11853,Tuesday,1995-08-01,00:00:10-04:00,0
1995-08-01 00:00:11-04:00,slppp6.intermind.net,1995-08-01 00:00:11-04:00,GET,/history/skylab/skylab-small.gif,200,9202,Tuesday,1995-08-01,00:00:11-04:00,0
1995-08-01 00:00:12-04:00,slppp6.intermind.net,1995-08-01 00:00:12-04:00,GET,/images/ksclogosmall.gif,200,3635,Tuesday,1995-08-01,00:00:12-04:00,0


In [137]:
date_vals = nasa_df.date.unique()

In [155]:
nasa_df[nasa_df['date'] ==date_vals[0]].between_time(start_time='00:00:00', end_time='03:00:00').date.value_counts().min()

3983

In [151]:
times = [['00:00:00', '03:00:00'], ['03:00:01', '06:00:00'],['06:00:01', '09:00:00'],['09:00:01', '12:00:00'], 
         ['12:00:01', '15:00:00'], ['15:00:01', '18:00:00'], ['18:00:01', '21:00:00'], ['21:00:01', '23:59:00']]

In [177]:
list_of_counts = []
for date in date_vals:
    for start,finish in times:
        count = nasa_df[nasa_df['date'] ==date].between_time(start_time=start, end_time=finish).date.value_counts().min()
        dict_of_count = {'Start': start, 'Finish': finish, 'Date': date, 'Counter': count}
        list_of_counts.append(dict_of_count)
time_segmentations = pd.DataFrame.from_records(list_of_counts)
time_segmentations = time_segmentations.dropna()

In [234]:
req_by_day = []
for start,finish in times:
    mini = time_segmentations[time_segmentations['Finish']==finish].Counter.min()
    maxi = time_segmentations[time_segmentations['Finish']==finish].Counter.max()
    meani = time_segmentations[time_segmentations['Finish']==finish].Counter.mean()
    time_of_day = start + ' ' + finish 
    dic_of_reqs_by_day = {'Time of Day': time_of_day, 'Requests(Min)': mini, 'Requests(Mean)': meani, 'Requests(Max)':maxi}
    req_by_day.append(dic_of_reqs_by_day)
req_by_day_df = pd.DataFrame.from_records(req_by_day)

In [191]:
req_by_day_df

Unnamed: 0,Requests(Max),Requests(Mean),Requests(Min),Time of Day
0,6787.0,4086.0,3331.0,00:00:00 03:00:00
1,6127.0,2804.166667,59.0,03:00:01 06:00:00
2,11997.0,4798.366667,242.0,06:00:01 09:00:00
3,18132.0,8730.4,2520.0,09:00:01 12:00:00
4,17038.0,10353.1,4525.0,12:00:01 15:00:00
5,15495.0,9980.793103,4730.0,15:00:01 18:00:00
6,9470.0,6401.655172,4208.0,18:00:01 21:00:00
7,10304.0,5933.275862,4653.0,21:00:01 23:59:00


In [197]:
nasa_df[nasa_df['date'] ==date_vals[0]].between_time(start_time='00:00:00', end_time='03:00:00').reply_bytes.value_counts().max()

313

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [258]:
get_the_bytes = nasa_df[nasa_df['reply_bytes']>0]

In [259]:
list_of_bytes = []
for start,finish in times:
    try:
        mini = get_the_bytes.between_time(start_time=start, end_time=finish).reply_bytes.min()
        maxi = get_the_bytes.between_time(start_time=start, end_time=finish).reply_bytes.max()
        meani = get_the_bytes.between_time(start_time=start, end_time=finish).reply_bytes.mean()
        time_of_day = start + ' ' + finish 
        dict_of_count = {'Time of Day': time_of_day, 'Requests(Mean)': meani, 'Requests(Max)': maxi, 'Requests(Min)': mini}
        list_of_bytes.append(dict_of_count)
    except:
        print(start, finish)
byte_segmentations = pd.DataFrame.from_records(list_of_bytes)

In [260]:
byte_segmentations

Unnamed: 0,Requests(Max),Requests(Mean),Requests(Min),Time of Day
0,1269716.0,20757.007784,68.0,00:00:00 03:00:00
1,3155499.0,20391.255477,68.0,03:00:01 06:00:00
2,3155499.0,17849.718965,68.0,06:00:01 09:00:00
3,3155499.0,18092.650575,28.0,09:00:01 12:00:00
4,3155499.0,18550.047398,50.0,12:00:01 15:00:00
5,3421948.0,18662.386597,68.0,15:00:01 18:00:00
6,3155499.0,20284.516418,68.0,18:00:01 21:00:00
7,1925120.0,19109.948388,66.0,21:00:01 23:59:00
