In [1]:
import pandas as pd
import math

In [2]:
use_data = pd.read_csv('Eluvio_DS_Challenge.csv')

In [3]:
use_data.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


Based on looks alone, it looks like this data is taken from reddit. I can tell from the up votes and downvotes. All of them under then thread regarding world news. The variables taken into account are the data created, number of up and down votes, titels, whether the content is open to those over 18 years old and the author. 

# Clean and Tidy the Data

In [4]:
use_data[use_data.category != 'worldnews']

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category


Getting rid of the category column since they are all worldnews

In [5]:
use_data = use_data.drop(columns = ['category'])
use_data.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans


Separate the data into two tables: The news stories safe for minors or not safe.

In [6]:
over_18 = use_data[use_data.over_18 == True].drop(columns = ['over_18'])
under_18 = use_data[use_data.over_18 == False].drop(columns = ['over_18'])

In [7]:
over_18

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,author
1885,1206381438,2008-03-24,189,0,Pics from the Tibetan protests - more graphic ...,pressed
6721,1211138718,2008-05-18,5,0,"MI5 linked to Max Mosley’s Nazi-style, sadomas...",alllie
8414,1212694925,2008-06-05,0,0,Tabloid Horrifies Germany: Poland s Yellow Pre...,stesch
12163,1216672016,2008-07-21,0,0,Love Parade Dortmund: Techno Festival Breaks R...,stesch
12699,1217381380,2008-07-30,5,0,IDF kills young Palestinian boy. Potentially N...,cup
...,...,...,...,...,...,...
503776,1477889966,2016-10-31,4,0,Latest Italian Earthquake Devastates Medieval ...,pixelinthe
508067,1479400229,2016-11-17,12,0,ISIS Release Video Showing Melbourne As A Poss...,halacska
508176,1479434681,2016-11-18,0,0,Animal welfare activists have released footage...,NinjaDiscoJesus
508376,1479492875,2016-11-18,6,0,Jungle Justice : Public lynching of a street ...,avivi_


In [8]:
under_18

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,author
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,polar
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,polar
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,polar
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,fadi420
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,mhermans
...,...,...,...,...,...,...
509231,1479816764,2016-11-22,5,0,Heil Trump : Donald Trump s alt-right white...,nonamenoglory
509232,1479816772,2016-11-22,1,0,There are people speculating that this could b...,SummerRay
509233,1479817056,2016-11-22,1,0,Professor receives Arab Researchers Award,AUSharjah
509234,1479817157,2016-11-22,1,0,Nigel Farage attacks response to Trump ambassa...,smilyflower


Looking at just the titles for each subgroup of reddit posts, I want to see if the title could be a good indicator of what could be classified as 18+. I decided to find a csv file that contains certain words that may be classifed as 18+.

In [9]:
import csv
file_swears = open('swearWords.csv')
data_swears = csv.reader(file_swears)
restricted_words = list(data_swears)[0]
# restricted_words (not gonna show because the words may be triggering)

In [10]:
swears = '|'.join(restricted_words)

In [11]:
over_18['title_lower'] = over_18['title'].str.lower()
under_18['title_lower'] = under_18['title'].str.lower()

In [12]:
over_18_swears = over_18.title_lower.str.contains(swears)
under_18_swears = under_18.title_lower.str.contains(swears)

Now I will perform and 2 prop Hypothesis Test.

In [13]:
x1 = over_18_swears.sum()
n1 = len(over_18_swears)
p1 = over_18_swears.mean()
print(x1)
print(n1)
print(p1)

134
320
0.41875


In [14]:
x2 = under_18_swears.sum()
n2 = len(under_18_swears)
p2 = under_18_swears.mean()
print(x2)
print(n2)
print(p2)

95027
508916
0.18672433171682556


Ho: p1 = p2 (There **is not** difference between the proportion of posts that are 18+ and those that are not that contain cenosored words)

vs.

H1: p1 != p2 (There **is** a difference between the proportion of posts that are 18+ and those that are not that contain censored words)

In [15]:
pooled_p = (p1*n1 + p2*n2)/(n1+n2)
pooled_p

0.18687013486870527

In [16]:
z = (p1-p2)/math.sqrt(pooled_p*(1-pooled_p)*((1/n1)+(1/n2)))
z

10.644484137073835

# Conclusion
With a z-value greater than 1.96 (the significance level at alpha = 0.05), we reject the null hypothesis. There is a difference between the proportion of posts that are 18+ and those that are not that contain censored words. Approximately 42% of the posts labeled as 18+ contained a censored word in the title while approximately 18.69% of the under 18 posts contained those words. 

Although there is a significant difference between the proportions, it is also pretty interesting how 95,000+ posts classified as safe for minors contained swear words in the title. Because of this, I dont think the title of the post could solely predict whether the rating is restricted or not. If I had more time and more access to all the variables, I'd like to predict this using a machine learning model. 