Data Description: The 4 attached json files (sub_class0.json, sub_class1.json, sub_class2.json, sub_class3.json),
represent four separate classes of top 100 submissions from subreddits with the following designations:
sub_class0.json is taken from r/basketball; sub_class1.json is taken from r/machinelearning; sub_class2.json is
taken from r/personalfinance; and sub_class3.json is taken from r/Louisville. Each individual submission has up
to 5 characteristics stored as key-value pairs, with the keys being Title, id, Score, Num_Comments, and Date*.
* Many submissions are missing one or more these characteristics: see instructions below.
Instructions (Five parts in total):


**Part 1.** For each json file, you are to take the following steps:
* a. Load the json file into a Python data-frame (read_json is recommended, though you are welcome to use
whatever approach preferred), and perform the following:
* b. discard any rows that lack ANY of the 5 characteristics (Title, id, Score, Num_Comments, Date). Note that
such rows will be missing the characteristics entirely (see the third submission within sub_class1.json as an
example). Rows that include empty strings, 0 values, etc. should not be discarded for possessing them.
* c. after doing so, discard the id characteristic, as it will not be used with future tasks.
* d. Save this modified data-frame to a json file with the name sub_class#_P1.json, where # is the original json file
id (ex: after modification, the data from sub_class0.json will be written to file sub_class0_P1.json, etc.) It is
highly recommended that you use the data-frame to_json method for this task. If you do not, you must mention
in your README or jupyter cells what alternative was used (so that we can properly read in your json file).


In [None]:
import pandas as pd

# Load json files
class0 = pd.read_json('sub_class0.json')
class0.Name = "sub_class0"
class1 = pd.read_json('sub_class1.json')
class1.Name = "sub_class1"
class2 = pd.read_json('sub_class2.json')
class2.Name = "sub_class2"
class3 = pd.read_json('sub_class3.json')
class3.Name = "sub_class3"

classes = [class0, class1, class2, class3]

# Discard rows with missing columns
for sub_class in classes:
    sub_class.dropna(inplace=True)

# Delete id column
for sub_class in classes:
    del sub_class['id']

# re-export to json
for sub_class in classes:
    print( sub_class)
    sub_class.to_json(sub_class.Name + '_P1.json', orient='records')



**Part 2.** For each modified data-frame (i.e. after the transformation from part 1) you are to take the following
steps:
* a. use textblob to calculate the sentiment polarity of each submission’s title (ignore the subjectivity, and don’t
use the NaiveBayesAnalyzer option). You may want to store these in a structure (ex: list) to make step b simple.
* b. Add a new column to your modified data-frame called “Sentiment,” which includes the value computed in
step a above for each entry/row.
* c. You are to then eliminate every row from your modified data-frame with “neutral” sentiment, defined to be
any sentiment that falls between -0.1 and 0.1, inclusive (you must eliminate rows with 0.1 or -0.1 sentiment).
* d. Save this modified data-frame to a json file with the name sub_class#_P2.json, where # is the original json file
id (ex: after modification, the data from sub_class0.json will be written to file sub_class0_P2.json, etc.)


In [68]:
from textblob import TextBlob

# Load json files
class0 = pd.read_json('sub_class0_P1.json')
class0.Name = "sub_class0"
class1 = pd.read_json('sub_class1_P1.json')
class1.Name = "sub_class1"
class2 = pd.read_json('sub_class2_P1.json')
class2.Name = "sub_class2"
class3 = pd.read_json('sub_class3_P1.json')
class3.Name = "sub_class3"

classes = [class0, class1, class2, class3]

# Add polarity
for sub_class in classes:
    sub_class.insert(2, 'Sentiment', 0.0) # part b
    for index, row in sub_class.iterrows(): # add sentiment polarity
        blob = TextBlob(row['Title'])
        sub_class.at[index, 'Sentiment'] = blob.sentiment.polarity # part a
        # print(row['Title'], blob.sentiment.polarity)
    for index, row in sub_class.iterrows(): # remove rows with neutral sentiment
        if row['Sentiment'] >= -0.1 and row['Sentiment'] <= 0.1:
            sub_class.drop(index, inplace=True) # part c
    
# re-export to json
for sub_class in classes: # part d
    sub_class.to_json(sub_class.Name + '_P2.json', orient='records')



**Part 3.** Using the data from step 2 (make sure you compute sentiment and exclude “neutral” submissions!),
create a simple bar plot that depicts the following quantities related to sentiment in order (8 bars in total):
* i. # of class0 submissions with positive sentiment, 
* ii. # of class0 submissions with negative sentiment
* iii. # of class1 submissions with positive sentiment, 
* iv. # of class1 submissions with negative sentiment
* v. # of class2 submissions with positive sentiment, 
* vi. # of class2 submissions with negative sentiment
* vii. # of class3 submissions with positive sentiment, 
* viii. # of class3 submissions with negative sentiment
You are free to use whatever presentation style you want for the bar plots, but make sure you include a legend or
x-axis labels that clearly indicate which quantity each bar represents – and you must use the bar order above (left
to right or top to bottom).


In [None]:
import matplotlib.pyplot as plt

# Load json files
class0 = pd.read_json('sub_class0_P2.json')
class0.Name = "sub_class0"
class1 = pd.read_json('sub_class1_P2.json')
class1.Name = "sub_class1"
class2 = pd.read_json('sub_class2_P2.json')
class2.Name = "sub_class2"
class3 = pd.read_json('sub_class3_P2.json')
class3.Name = "sub_class3"

classes = [class0, class1, class2, class3]

# sentiment distribution bar plot
# count positve and negative for each class
positive = {
    "sub_class0": 0,
    "sub_class1": 0,
    "sub_class2": 0,
    "sub_class3": 0
}
negative = {
    "sub_class0": 0,
    "sub_class1": 0,
    "sub_class2": 0,
    "sub_class3": 0
}

for sub_class in classes:
    for index, row in sub_class.iterrows():
        if row['Sentiment'] > 0:
            positive[sub_class.Name] += 1
        elif row['Sentiment'] < 0:
            negative[sub_class.Name] += 1
    
# plot
fig, ax = plt.subplots()
df = pd.DataFrame([positive, negative], index=['Positive', 'Negative'])
df = df.transpose()
df.plot(kind='bar', ax=ax)
plt.show()




**Part 4.** Pool together all modified submissions from Part 2 into a single data-frame, but include an additional
column named Class that corresponds to the original submission’s subreddit class (i.e. 0, 1, 2, or 3). Ex: If there
are 37 submissions from the class2 subreddit in the combined data-frame, the Class column for each of the 37
submissions should have a value of 2. Save this consolidated data-frame to a json file with the name
sub_combined.json.



In [70]:
# Load json files
class0 = pd.read_json('sub_class0_P2.json')
class0.Name = "0"
class1 = pd.read_json('sub_class1_P2.json')
class1.Name = "1"
class2 = pd.read_json('sub_class2_P2.json')
class2.Name = "2"
class3 = pd.read_json('sub_class3_P2.json')
class3.Name = "3"

classes = [class0, class1, class2, class3]

# combine and add class column
combined = pd.DataFrame()
for sub_class in classes:
    sub_class.insert(0, 'Class', sub_class.Name)
    combined = pd.concat([combined, sub_class])

# re-export to json
combined.to_json('sub_combined.json', orient='records')


**Part 5.** 
* i. Using your combined collection data-frame from Part 4, assuming it has n submissions in total, your next
goal is to construct a n x 5 numpy feature array suited for machine learning. Each row in the array corresponds to
a row in your data-frame, and the 5 array columns represent the features for the submission at that position as
follows:
* Feature 1: The character length of the submission’s title (don’t preprocess the title in any way!)
* Feature 2: The sentiment of the submission’s title.
* Feature 3: The submission’s Score.
* Feature 4: The submission’s number of comments (Num_Comments).
* Feature 5: The year of the submission (note there are several ways to extract the year from a Date in Python).
For example, one row in your feature array may look like the below:
[46. , -0.2 , 1356.0 , 249.0, 2020. ]
meaning that the corresponding submission had a title with 46 characters, sentiment of -0.2, score of 1356, 249
comments, and was posted in the year 2020.
* ii. Now create a n x 1 numpy array using the Class column you added in Part 4. Each row here represents a target
variable for classification that matches the corresponding submission. For example, if the first 46 rows in your
data-frame from Part 4 correspond to class 0 submissions, then the first 46 entries in this array should likewise be 0.
* iii. You are to then perform 10-fold cross-validation using one classification estimator (either one used during the
course module on machine learning, or one of your own choosing) to determine the accuracy available in using
the features from the nx5 array in predicting the Class given in the nx1 array. For full credit, you should produce
both classifier accuracy for each cross-validation and a confusion matrix for the classifier predictions. You may
want to look into using cross_val_predict for the latter. Write a few sentences (comments or a separate text file
are fine for this) discussing whether or not you think the features and chosen classifiers provide acceptable
accuracy for the task.

In [None]:
import numpy as np

combined = pd.read_json('sub_combined.json')
# print(combined)

# make np array
npCombine = np.zeros((len(combined), 5))
for index, row in combined.iterrows():
    npCombine[index][0] = len(row['Title'])
    # print(len(row['Title']))
    npCombine[index][1] = row['Sentiment']
    # print(row['Sentiment'])
    npCombine[index][2] = row['Score']
    # print(row['Score'])
    npCombine[index][3] = row['Num_Comments']
    # print(row['Num_Comments'])
    npCombine[index][4] = pd.to_datetime(row['Date'], unit='s').year
    # print(pd.to_datetime(row['Date'], unit='s').year)

# np.set_printoptions(suppress=True)
# print(npCombine)

# np array of class
npClass = np.array(combined['Class'])

# print(npClass)

# 10 fold cross validation to determine accuracy that first 5 features can predict class
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

knn = KNeighborsClassifier()
kfold = KFold(n_splits=10, random_state=11, shuffle=True)
scores = cross_val_score(knn, npCombine, npClass, cv=kfold)
print(f'K Neighbors Classifier: ' +
          f'mean accuracy={scores.mean():.2%}, ' +
          f'standard deviation={scores.std():.2%}')

predicted = cross_val_predict(knn, npCombine, npClass, cv=kfold)
conf_matrix = confusion_matrix(npClass, predicted)
print("\nConfusion Matrix: ")
print(conf_matrix)

Not sure if I did this right but the results make it seem like these features are good at predicted the class of the subreddit. I would have thought the text of the title would be the only data useful for predicted where its from but it seems like the model is able to figure out something I can't see. 