## Washington Post app reviews analysis

Since you all are investigative journalists, time to [reproduce this story](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/)!

I've given you `reviews-marked.csv`, a CSV file of app reviews. Some are marked as being about unwanted sexual behavior – the `sexual` column – while some are unlabeled.

> You can use [this notebook](https://github.com/jsoma/dcj-indonesia-2022-machine-learning/blob/main/classification-app-reviews/Therapy%20app%20reviews.ipynb) as a source of cut-and-paste. It will get you 99% of the way there with practically no edits!

## Section A: simple TF-IDF Vectorizer

1. Build a classifier to filter for reviews for the journalists to check out.
2. Get a list of the most useful words for the classification
3. Use a test/train split and confusion matrix to determine how well your process works.

Since the labeled and unlabeled data is all in one file, you'll need to filter for ones with labels to build your training dataset.

You can use whatever kind of classifier you want. For your vectorizer, use a basic TF-IDF vectorizer with a 300-feature limit:

```python
vectorizer = TfidfVectorizer(max_features=300)
```

**How many of the unlabeled reviews does your believes are about unwanted sexual behavior?** Your classifier doesn't have to do a *good* job, it just has to work.

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400

df = pd.read_csv("reviews-marked.csv")
df.head(3)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual
0,US,11/22/2019,5,It’s a great app to meet new people and chat in very satisfied with downloading this app i recommend this app if you like to chat or just to meet new people. And you can choose which country To find different users!,4.4.5,holla,,,
1,US,11/22/2019,5,"Holla is an excellent app, where I get to know new people every time and even get to make new friends. I truly recommend this application to all people!",4.4.5,holla,,,
2,US,11/22/2019,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,3.8,holla,0.0,0.0,0.0


In [2]:
labeled_data = df[df.sexual.notna()]
labeled_data.head(3)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual
2,US,11/22/2019,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,3.8,holla,0.0,0.0,0.0
6,US,11/21/2019,1,This is good but most of my messages never show up. This is very crapy and needs to be fixed,3.3.1,skout,0.0,0.0,0.0
8,US,11/20/2019,1,I was really enjoying this app. This brought me out of the box. I’m an extremely shy person and this gave me somewhere to talk to nice people. I just got kicked of bc I’m 16 not “18” and I think that this change it kind of stupid bc yeah it’s for protection but like someone else said all you have to do is put age preferences like you do for gender not that hard I wish this wasn’t the case or t...,4.2.1,holla,0.0,0.0,0.0


In [3]:
unlabeled_data = df[df.sexual.isna()]
unlabeled_data.head(3)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual
0,US,11/22/2019,5,It’s a great app to meet new people and chat in very satisfied with downloading this app i recommend this app if you like to chat or just to meet new people. And you can choose which country To find different users!,4.4.5,holla,,,
1,US,11/22/2019,5,"Holla is an excellent app, where I get to know new people every time and even get to make new friends. I truly recommend this application to all people!",4.4.5,holla,,,
3,US,11/22/2019,5,"Free to use app, meet people around the world.",-,holla,,,


In [4]:
print("Labeled is", len(labeled_data))
print("Unlabeled is", len(unlabeled_data))

Labeled is 330
Unlabeled is 55726


In [5]:
!pip install pystemmer
!pip install sklearn



In [6]:
!pip install --upgrade setuptools



In [7]:
!pip install pystemmer
!pip install sklearn



In [8]:
!pip install Cython



In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(max_features=300)
matrix = vectorizer.fit_transform(labeled_data.Review)

In [10]:
vectorizer.get_feature_names_out()

array(['10', '50', 'abl', 'about', 'account', 'actual', 'ad', 'add',
       'after', 'again', 'age', 'all', 'almost', 'also', 'alway', 'an',
       'and', 'ani', 'annoy', 'anoth', 'anyon', 'anyth', 'app', 'are',
       'as', 'ask', 'at', 'away', 'back', 'bad', 'ban', 'be', 'becaus',
       'been', 'befor', 'better', 'block', 'bot', 'bug', 'but', 'buy',
       'by', 'call', 'camera', 'can', 'care', 'caus', 'chang', 'chat',
       'close', 'come', 'comment', 'connect', 'constant', 'contact',
       'convers', 'cool', 'could', 'crash', 'date', 'day', 'delet', 'did',
       'didn', 'differ', 'do', 'doe', 'doesn', 'don', 'down', 'download',
       'easi', 'els', 'email', 'even', 'ever', 'everi', 'everyon',
       'everyth', 'fake', 'featur', 'femal', 'few', 'find', 'fix', 'for',
       'freez', 'friend', 'from', 'full', 'fun', 'gay', 'get', 'give',
       'glitch', 'go', 'good', 'got', 'great', 'guy', 'had', 'happen',
       'hard', 'has', 'hate', 'have', 'haven', 'help', 'here', 'horribl',

In [11]:
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,10,50,abl,about,account,actual,ad,add,after,again,...,won,work,wors,would,wrong,year,yet,you,your,zero
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.184195,0.000000,0.0,0.0,0.178109,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.211241,0.0,0.0,0.000000,0.000000,0.0,0.0,0.120541,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.141838,0.000000,0.0,0.0,0.000000,0.000000,0.0
326,0.0,0.0,0.0,0.0,0.232741,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0
327,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.155329,0.000000,0.0
328,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0


In [12]:
from sklearn.svm import LinearSVC

X = matrix
y = labeled_data.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

In [13]:
!pip install eli5



In [14]:
import eli5

eli5.explain_weights(clf, feature_names=vectorizer.get_feature_names_out())

Weight?,Feature
+2.026,nude
+1.831,guy
+1.517,men
+1.332,thing
+1.319,without
+1.251,pay
+1.244,off
+1.109,video
+1.075,on
+1.073,find


In [20]:
X = vectorizer.fit_transform(labeled_data.Review)

labeled_data['predicted'] = clf.predict(X)
labeled_data['predicted_proba'] = clf.decision_function(X)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_data['predicted'] = clf.predict(X)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_data['predicted_proba'] = clf.decision_function(X)


In [21]:
labeled_data.sort_values(by='predicted_proba', ascending=False)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual,predicted,predicted_proba
113,US,09/06/2019,1,The only thing people want on this app is either nudes or your kik,2.6,chat-for-strangers,0.0,0.0,1.0,1.0,0.950378
811,US,12/29/2018,1,You say nudity isn’t allowed yet you allow men and women to show their private parts all the time. Need to pay more attention to your client base,5.01.1,skout,0.0,0.0,1.0,1.0,0.943979
138,US,08/27/2019,1,I went on there to find new friends but I can’t go 3 matches without finding a creepy 30 year old or finding some guy with his d*** out!!!,3.1.3,holla,0.0,0.0,1.0,1.0,0.934088
842,US,12/23/2018,1,"I went on this and all there was where pervs asking for nudes! Im only 14 and this made me want to stay away from the scary place of the internet! Even if i started to get close to someone they would later ask something like ""hey wanna trade nudes now that we're friends?"" It would be a great app if it had restrictions and plz dont put in a camera because i know some of these pervs will post nu...",3.4,chat-for-strangers,0.0,0.0,1.0,1.0,0.928644
604,US,02/23/2019,1,"all this app supports is hate and negativity. You can’t get on this app without being constantly made fun of. I’ve gotten extremely suicidal because of this dumb app. Your asked for nudes 24/7, and get bullied 24/7. I wouldn’t start convos with anyone, and they’d randomly message me calling me fat or ugly or even sometimes both. The app even supports it as they let you do these livestreams cal...",3.3.1,holla,0.0,1.0,1.0,1.0,0.915576
...,...,...,...,...,...,...,...,...,...,...,...
26,US,11/05/2019,2,please fix it as soon as you can coz its annoying,3.1,skout,0.0,0.0,0.0,0.0,-1.685809
849,US,12/22/2018,2,Keeps freezing up,4.21.2,skout,0.0,0.0,0.0,0.0,-1.703934
490,US,03/29/2019,1,I didn't really like it sorry,4.0.2,skout,0.0,0.0,0.0,0.0,-1.730768
441,US,04/16/2019,1,Glitchy af fix it please,4.2.1,holla,0.0,0.0,0.0,0.0,-1.766389


In [22]:
labeled_data.sort_values(by='predicted_proba', ascending=False)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual,predicted,predicted_proba
113,US,09/06/2019,1,The only thing people want on this app is either nudes or your kik,2.6,chat-for-strangers,0.0,0.0,1.0,1.0,0.950378
811,US,12/29/2018,1,You say nudity isn’t allowed yet you allow men and women to show their private parts all the time. Need to pay more attention to your client base,5.01.1,skout,0.0,0.0,1.0,1.0,0.943979
138,US,08/27/2019,1,I went on there to find new friends but I can’t go 3 matches without finding a creepy 30 year old or finding some guy with his d*** out!!!,3.1.3,holla,0.0,0.0,1.0,1.0,0.934088
842,US,12/23/2018,1,"I went on this and all there was where pervs asking for nudes! Im only 14 and this made me want to stay away from the scary place of the internet! Even if i started to get close to someone they would later ask something like ""hey wanna trade nudes now that we're friends?"" It would be a great app if it had restrictions and plz dont put in a camera because i know some of these pervs will post nu...",3.4,chat-for-strangers,0.0,0.0,1.0,1.0,0.928644
604,US,02/23/2019,1,"all this app supports is hate and negativity. You can’t get on this app without being constantly made fun of. I’ve gotten extremely suicidal because of this dumb app. Your asked for nudes 24/7, and get bullied 24/7. I wouldn’t start convos with anyone, and they’d randomly message me calling me fat or ugly or even sometimes both. The app even supports it as they let you do these livestreams cal...",3.3.1,holla,0.0,1.0,1.0,1.0,0.915576
...,...,...,...,...,...,...,...,...,...,...,...
26,US,11/05/2019,2,please fix it as soon as you can coz its annoying,3.1,skout,0.0,0.0,0.0,0.0,-1.685809
849,US,12/22/2018,2,Keeps freezing up,4.21.2,skout,0.0,0.0,0.0,0.0,-1.703934
490,US,03/29/2019,1,I didn't really like it sorry,4.0.2,skout,0.0,0.0,0.0,0.0,-1.730768
441,US,04/16/2019,1,Glitchy af fix it please,4.2.1,holla,0.0,0.0,0.0,0.0,-1.766389


In [28]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

X = vectorizer.fit_transform(labeled_data.Review)
y = labeled_data.sexual

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train
clf = LinearSVC(class_weight='balanced')
clf.fit(X_train, y_train)

# Test
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

# How did it do?
label_names = pd.Series(['not sexual', 'sexual'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not sexual,Predicted sexual
Is not sexual,79,0
Is sexual,3,1


## Section B: A custom vectorizer

Repeat the above, but **customize your `TfidfVectorizer`** in an attempt to improve your model. You can also change your classifier, if you'd like.

For the vectorizer, you can add:

|term|use|
|---|---|
|`vocabulary=`|a custom list of words to look at|
|`stopwords=`|a list of words to ignore|
|`max_features=`|the total number of features (words) to count|
|`max_df=`|the maximum number of documents a word can appear in|
|`min_df=`|the minimum number of documents a word must appear in|

For `max_df` and `min_df`, it can be a number or a decimal percentage. For example, `3` means 3 documents and `0.3` means 30% of documents.

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(max_features=300, max_df=0.30)
matrix = vectorizer.fit_transform(labeled_data.Review)

In [30]:
vectorizer.get_feature_names_out()

array(['10', '50', 'abl', 'about', 'account', 'actual', 'ad', 'add',
       'after', 'again', 'age', 'all', 'allow', 'almost', 'also', 'alway',
       'am', 'an', 'ani', 'annoy', 'anoth', 'anymor', 'anyon', 'anyth',
       'are', 'as', 'ask', 'at', 'away', 'back', 'bad', 'ban', 'be',
       'becaus', 'been', 'better', 'block', 'bot', 'bug', 'but', 'buy',
       'by', 'camera', 'can', 'care', 'caus', 'chang', 'chat', 'close',
       'come', 'comment', 'connect', 'constant', 'contact', 'convers',
       'cool', 'could', 'crash', 'date', 'day', 'delet', 'did', 'didn',
       'differ', 'disappear', 'do', 'doe', 'doesn', 'don', 'down',
       'download', 'easi', 'email', 'even', 'ever', 'everi', 'everyon',
       'everyth', 'fake', 'featur', 'femal', 'few', 'filter', 'find',
       'fix', 'for', 'free', 'freez', 'friend', 'from', 'full', 'fun',
       'gay', 'gem', 'gender', 'get', 'girl', 'give', 'glitch', 'go',
       'good', 'got', 'great', 'guy', 'had', 'happen', 'hard', 'has',
       '

In [31]:
pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,10,50,abl,about,account,actual,ad,add,after,again,...,women,won,work,wors,would,wrong,yet,you,your,zero
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.219475,0.000000,0.0,0.212223,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.220191,0.0,0.0,0.000000,0.000000,0.0,0.125648,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.146870,0.000000,0.0,0.000000,0.000000,0.0
326,0.0,0.0,0.0,0.0,0.230462,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
327,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.171612,0.000000,0.0
328,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0


In [32]:
from sklearn.svm import LinearSVC

X = matrix
y = labeled_data.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

In [33]:
import eli5

eli5.explain_weights(clf, feature_names=vectorizer.get_feature_names_out())

Weight?,Feature
+1.997,nude
+1.869,guy
+1.524,men
+1.332,thing
+1.253,filter
+1.189,off
+1.131,find
+1.122,without
+1.067,on
+0.999,pay


In [34]:
X = vectorizer.fit_transform(labeled_data.Review)

labeled_data['predicted'] = clf.predict(X)
labeled_data['predicted_proba'] = clf.decision_function(X)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_data['predicted'] = clf.predict(X)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_data['predicted_proba'] = clf.decision_function(X)


In [35]:
labeled_data.sort_values(by='predicted_proba', ascending=False)

Unnamed: 0,Country,Date,Rating,Review,Version,source,racism,bullying,sexual,predicted,predicted_proba
113,US,09/06/2019,1,The only thing people want on this app is either nudes or your kik,2.6,chat-for-strangers,0.0,0.0,1.0,1.0,0.966283
811,US,12/29/2018,1,You say nudity isn’t allowed yet you allow men and women to show their private parts all the time. Need to pay more attention to your client base,5.01.1,skout,0.0,0.0,1.0,1.0,0.932436
842,US,12/23/2018,1,"I went on this and all there was where pervs asking for nudes! Im only 14 and this made me want to stay away from the scary place of the internet! Even if i started to get close to someone they would later ask something like ""hey wanna trade nudes now that we're friends?"" It would be a great app if it had restrictions and plz dont put in a camera because i know some of these pervs will post nu...",3.4,chat-for-strangers,0.0,0.0,1.0,1.0,0.930526
138,US,08/27/2019,1,I went on there to find new friends but I can’t go 3 matches without finding a creepy 30 year old or finding some guy with his d*** out!!!,3.1.3,holla,0.0,0.0,1.0,1.0,0.926850
247,US,06/24/2019,1,everywhere perverts.\nbad quality video\ncannot filter without paying\nlame,2.1.9,holla,0.0,0.0,1.0,1.0,0.909296
...,...,...,...,...,...,...,...,...,...,...,...
437,US,04/18/2019,1,Help it won't let me connect to people,2.1,chat-for-strangers,0.0,0.0,0.0,0.0,-1.563538
887,US,12/14/2018,1,It was great.. Tons of people.. But then it was glitching and now it doesn't even work.. I'm sad :(,4.0.4,skout,0.0,0.0,0.0,0.0,-1.595506
526,US,03/19/2019,1,...could use more security for those who are really honest people and are sincere about what they are looking for.,3.7.1,skout,0.0,0.0,0.0,0.0,-1.665930
658,US,02/03/2019,1,The boost feature really giving attention to people who are turning to spread hate,3.23.1,holla,0.0,0.0,0.0,0.0,-1.673402


In [38]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

X = vectorizer.fit_transform(labeled_data.Review)
y = labeled_data.sexual

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train
clf = LinearSVC(class_weight='balanced')
clf.fit(X_train, y_train)

# Test
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

# How did it do?
label_names = pd.Series(['not sexual', 'sexual'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not sexual,Predicted sexual
Is not sexual,77,0
Is sexual,5,1
