In [37]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import nltk
import re

sns.set()

## 1 - Open training data

In [65]:
train_data = "../data/reddit_train.csv"
train = pd.read_csv(train_data)
train.head()

Unnamed: 0,id,comments,subreddits
0,0,"Honestly, Buffalo is the correct answer. I rem...",hockey
1,1,Ah yes way could have been :( remember when he...,nba
2,2,https://youtu.be/6xxbBR8iSZ0?t=40m49s\n\nIf yo...,leagueoflegends
3,3,He wouldn't have been a bad signing if we woul...,soccer
4,4,Easy. You use the piss and dry technique. Let ...,funny


### Minimal cleaning

In [66]:
def replace_link(word):
    if 'http' in word:
        search = re.search(r'(http[s]?:\/\/)(.*?)(\/.*)', link)
        if search:
            return search.group(2)
        else:
            return word
    else:
        return word
        
train['upper'] = train['comments'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train['comments'] = train['comments'].apply(lambda x: " ".join(w.lower() for w in x.split()))
train['comments'] = train['comments'].apply(lambda x: " ".join(replace_link(w) for w in x.split()))
train['comments'] = train['comments'].str.replace('[^\w\s]','') # removing punctuation
train.head()

Unnamed: 0,id,comments,subreddits,upper
0,0,honestly buffalo is the correct answer i remem...,hockey,2
1,1,ah yes way could have been remember when he w...,nba,1
2,2,youtube if you didnt find it already nothing o...,leagueoflegends,0
3,3,he wouldnt have been a bad signing if we would...,soccer,1
4,4,easy you use the piss and dry technique let a ...,funny,0


In [67]:
stat_cols = ['char_count', 'n_questions', 'youtube_url', 'non_english', 'numerics', 'upper']

train['char_count']  = train['comments'].str.len() ## this also includes spaces
train['n_questions'] = train['comments'].apply(lambda x: x.count('?'))
train['youtube_url'] = train['comments'].apply(lambda x: x.count('youtu'))
train['numerics'] = train['comments'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

# English words
words = set(nltk.corpus.words.words())

def number_nonenglish_words(comment):
    nonenglish = [w for w in comment.split() if not w.lower() in words]
    return len(nonenglish)

train['non_english'] = train['comments'].apply(number_nonenglish_words)

In [68]:
stats = train[stat_cols + ['subreddits']]
stats.groupby('subreddits').mean()

Unnamed: 0_level_0,char_count,n_questions,youtube_url,non_english,numerics,upper
subreddits,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AskReddit,236.275429,0.0,0.047714,5.616857,0.335429,1.598
GlobalOffensive,170.702857,0.0,0.123143,5.378,0.389429,1.245714
Music,392.850571,0.0,0.305429,13.181429,1.112857,1.457429
Overwatch,240.945143,0.0,0.154,6.786,0.387429,1.680286
anime,232.803714,0.0,0.197143,6.902571,0.357429,1.673714
baseball,168.938571,0.0,0.060286,5.841714,0.711429,1.225714
canada,276.514857,0.0,0.071429,7.402571,0.383143,1.095429
conspiracy,274.048857,0.0,0.158286,7.388286,0.313143,1.255143
europe,265.829429,0.0,0.092,7.729429,0.361714,1.084286
funny,161.592,0.0,0.053143,4.189143,0.212857,1.098571


## 2 - EDA
 

### Different Categories

In [5]:
set(train['subreddits'])

{'AskReddit',
 'GlobalOffensive',
 'Music',
 'Overwatch',
 'anime',
 'baseball',
 'canada',
 'conspiracy',
 'europe',
 'funny',
 'gameofthrones',
 'hockey',
 'leagueoflegends',
 'movies',
 'nba',
 'nfl',
 'soccer',
 'trees',
 'worldnews',
 'wow'}

### 2.1 AskReddit

In [28]:
subset = train[train['subreddits'] == 'AskReddit']
others = train[train['subreddits'] != 'AskReddit']

for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
was at a trampoline centre for my sisters birthday. we sat down to have a coffee after the session. i comment that the coffee was really hot and it could burn someone. the table next to mine was slightly sloped. a woman puts her baby in a high chair next to the table and the waiter puts the coffee down. since the table was sloped i saw the coffee tip and run off the table. woman leaps up to clean up the table. but the super hot coffee runs into the lap of the baby in the high chair. it starts screaming at it's scalding. the mum freaks out, not knowing what happened. i stand up and grab her baby and shout "hot coffee on its lap" and rip the soaked romper off. i hand the baby back to the mum, jog to the fridge and grab 3 bottles of cold water to pour over the babies lap. while the mum is holding the baby i pour each bottle after another. i tell the stunned staff to call 000 and report a burn with hot water. i tell another staff to bring me a cold wet towel. i could see the babies ski

In [None]:
subset['length'] = subset['comments']

### 2.2 GlobalOffensive

In [7]:
subset = train[train['subreddits'] == 'GlobalOffensive']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
afraid i'll get addicted and fail school or smthing lol
---
fuckin n0thing one taps him, stands up and takes his shirt off what a straight g
---
i mean maybe theres a point to playing a stretched resolution, but why is his donation notification also stretched af.
---
but to be fair, both the igls were doing there part. none of them was doing a pronax.
---
hell yeah! finals hype! i'm so proud of imt. they wrecked vp in their home soil. they played so well. who would have thought, a week ago, that the brazilian team in the finals would not be sk. brazilian cs is a hell of a drug.
---
excepted imt to go out in groups or quarters at best, gambit i had some hope for considering the amount of time they prepped for the major. read a few times that reseeding after each round with the swiss system should work fairly great with even matches. with your system you could just be placed in a group with 5 really bad teams who all barely made it past the qualifier and reach playoffs - more teams =

### 2.3 Music

In [8]:
subset = train[train['subreddits'] == 'Music']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
/u/perpetualhorrors, your post has been removed as this artist is in our [updated hall of fame](https://www.reddit.com/r/music/wiki/halloffame). *i am a bot, and this action was performed automatically. please [contact the moderators of this subreddit](/message/compose/?to=/r/music) if you have any questions or concerns.*
---
this is my favorite song. has been since the first time i heard it....15 years ago.
---
oblivius is a top 7-10 strokes song for me for sure, with drag queen and toj not far behind. i love the ep and haven't taken it out of my rotation since release.
---
hesitation marks is probably the most accessible nin album for casual listeners. not completely dissimilar, but nothing like the downward spiral etc.
---
there's that other post with 'not all metal is insane noise and screaming'. lots of good stuff in there.
---
i didn't listen to n-sync but timberlake's solo stuff is great. i feel the same way about beiber's early shit versus the latest album and the collabs h

### 2.4 Overwatch

In [9]:
subset = train[train['subreddits'] == 'Overwatch']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
right! he was a disruptor tank! pull the dps out of the frey and pick them off, then get on the point. now he can't reliably go one on one with any hero and expect to win... it's so heartbreaking..
---
one shot, one kill! wait... that's not the right character...
---
slightly related, let's just take a second to appreciate how baddass pharah looks/is
---
didn't watch the clip but maybe they were trying to increase the time between enemy players' and her respawn?
---
i think that does have something to do with it. i was around 2600 for the first 4 seasons. i d/c'ed right as i joined a game in my placements this season, got placed around 2200, now i can't really be bothered to try to climb back into plat. i play the occasional comp game but i just don't have the time to actually try to climb. if i get there fine, if i don't then i guess i'm gold now. maybe everyone got better while i didn't. maybe i just don't have enough time this season.
---
overwatch is already exempt from loot bo

### 2.5 Anime

In [10]:
subset = train[train['subreddits'] == 'anime']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
akagi is still alpha as fuck and sugawara is suffering. definitely my two favorite couples so far.
---
"oh man i can't wait to vote." *opens link* *starts sweating*
---
omg i was thinking the same.... azumi u the alpha af boi
---
i'm glad you're considering a rewatch. this series is far from perfect but it's certainly very enjoyable. i personally did like subaru and more interesting characters are added to the cast later on as well
---
i live with my shanghainese girlfriend and i agree with the op.
---
what does speed-subbed mean? anyway, i have wanted to watch it for a few weeks now, but i'm currently in the middle of 5 different shows (trigun, lucky star, darker than black, firefly, and the big o)
---
i swear that guy is there to channel the author's sarcasm into oblivious quotes.
---
plot twist: they just loop that ending for 90 minutes, cue everyone burning the theaters down. anno walks away so rich he couldn't have made a better end.
---
minor spoilers to see if it gets your a

### 2.6 Baseball

In [11]:
subset = train[train['subreddits'] == 'baseball']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
braves are a solid b in my book. the fact we're even talking about a possible playoff run is exciting. if you told me at the start of the season we'd be hovering around .500 i would've been thrilled. pitching has been shaky but not horrible, freddie hasn't skipped a beat since his injury, ender is killing it in cf, and johan camargo is exceeding all expectations. trade some of our vets and bring some of our young guys up soon and the braves are gonna be so fun to watch.
---
the 2016 texas rangers need to show them a few things.
---
someone wrote an article detailing the fictitious backstories of everyone in that gif and it might be the best thing i've ever read. edit: [here it is](https://www.progressiveboink.com/2012/5/31/2995814/yankeesfans-gif-animated-yankee-enthusiasts-story)
---
someone strip this fool of his moderatorship! tell me how this is not a high quality video!
---
there were 15 singles and 2 doubles before the grand slam. almost become a "death by a thousand singles"

### 2.7 Canada

In [12]:
subset = train[train['subreddits'] == 'canada']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
they don't have to - but do they have the right to. very different things. say there is a show at a camp ground and there are men in a lake - topless. should woman have the right to be topless in a lake on the same kids program?
---
it's 8 fucking stairs lol. 65 grand is a joke
---
what i'm trying to say is that it's hypocritical for people to be concerned about the well being of a little puppy, while these same people don't even flinch when this puppy's evolutionary cousins are slaughtered by the millions. canada's laws are irrelevant, dogs, cows, pigs and other mammals are all sentient beings, it makes no logical sense to care about the well-being of one mammal while completely neglecting/encouraging the abuse and killing of other mammals. it only makes logical sense if you want to maintain an omnivore's cognitive dissonance on such a topic.
---
your entire post is a stretch. what if the guy was black or hispanic
---
private property and activity is still fall under the rules of 

### 2.8 - Conspiracy

In [13]:
subset = train[train['subreddits'] == 'conspiracy']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
why? why would they want isis in charge? to keep the region unstable? why? i don't get it
---
this type of discussion has been around since before the hungarian attempted revolt against the soviet union. this type of discussion seems to completely negate the millions murdered by the leaders of the soviet union; in hungry, the ukraine, czechoslovakia, afghanistan... the people inside the soviet union suffered communism with lack of nutrition, basic freedoms, choices.
---
yes, they(he?) told me that i was arguing semantics. which is bullshit considering it's a huge difference. fusion worked for them to lobby against the magnitsky act.
---
again, why is he on the list? because of his pizzas? what am i missing?
---
&gt;so why did he retweet it, if not an endorsement of that tweet? having a hard time engaging in critical thinking? remember your first post, when you said this: &gt;i think you are a simple person who labels people in absence of critical thought. you literally started off 

### 2.9 - Europe

In [14]:
subset = train[train['subreddits'] == 'europe']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
i like people who can find something postitive about something negative! have my upvote.
---
as i pointed out in other post, france most certainly didn't. treaty stipulated more things than declaring war.
---
the present border between austria and switzerland was only set in 1972.
---
i don't like corbyn's chances. may shot herself in the foot several times and still he can't pull away in any polls still trailing in some of them.
---
yeah, that's great, just what i was looking for. we also have a dick idiom, which goes something like this: **рђавом курцу и длака смета**, meaning *a weak dick is bothered even by pubic hair*, a.k.a. a bad workman always blames his tools.
---
the main route through is considered international waters so no i don't think they could block them legally.
---
there was no genocide against fascist soldiers after ww2. this isn't genocide denial.
---
this is a case where conteckst and cultural background is important. sure, i would be offended if i were to see

### 2.10 - Funny

In [15]:
subset = train[train['subreddits'] == 'funny']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
easy. you use the piss and dry technique. let a few drops out, let it dry, rinse and repeat. if you get lucky and end up using the piss and dry technique on a hot sunny day, this process will go by a lot faster.
---
the joke is on you! i've only seen it twice... :/
---
there is a more jpeg bot? explain to the uneducated?
---
10 hours ago when i first saw it, moscow airport. the next 7 or 8 were mixed. mostly moscow though.
---
martin luther kicked ass in his day. ...and he was right imho
---
---
my comment was so corny even /r/dadjokes would hate it, down votes deserved. why would anyone down votes your honest question lol
---
ah yes, everyone's favorite series of films detailing what happens when you fail to plan appropriately for what you are undertaking!
---
guilty of what? sucks to be them but what do i have to feel guilty about? was compassion the emotion you were looking for?
---
i see most of the people on our base wearing them like sashes.
---
another non helmet wearing ret

### 2.11 Game Of Thrones

In [17]:
subset = train[train['subreddits'] == 'gameofthrones']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
---
personally i never really liked that biatch. she was always just too perfect. i'm glad euron effed up her fleet.
---
fire can't hurt dragons. it can hurt humans like jon and any other targaryen exept dany cause she is a unique case atm.
---
your submission has been removed because of the following reason(s): **[3. quality:](/r/gameofthrones/wiki/posting_policy#wiki_3.__content_must_be_high_quality_and_provide_unique_value_to_the_community)** reposted content must be old enough to be considered fresh again. that includes links to the same content and repeats of topics that have been posted recently. please check [the new posts list](/r/gameofthrones/new/) before you submit. **if similar content is posted too often, it may be removed outside of the given time frame to make room for more fresh content in an effort to keep with the spirit of the rule encouraging "fresh" content.** [posting policy](/r/gameofthrones/w/posting_policy) | [spoiler guide](/r/gameofthrones/w/spoiler_guide

### 2.12 Hockey

In [18]:
subset = train[train['subreddits'] == 'hockey']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
honestly, buffalo is the correct answer. i remember people (somewhat) joking that buffalo's mantra for starting goalies was "win a game, get traded". i think edmonton's front office was a travesty for the better part of 10 years, but buffalo's systematic destruction of the term 'competitive' was much more responsible for the change to the draft lottery.
---
i honestly couldn't find an overall source where it showed all toi for the whole tournament but i did find individual games on the iihf site [here](http://m.worldjunior2017.com/en/games/2017-01-05/usa-vs-can/#statistics-tab), and he had the 4th most minutes among forwards in both the semifinals and the finals, at least - i figured that was enough to confirm what i remembered.
---
yeah, you didn't see anyone take runs at gretzky like they have at sid. it's a different league.
---
&gt; price hasn't had enough around him which is exactly my point. mcdavid can be jesus on ice but if he doesn't have the team to back him up then his t

### 2.13 Leagueoflegends

In [19]:
subset = train[train['subreddits'] == 'leagueoflegends']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
https://youtu.be/6xxbbr8isz0?t=40m49s if you didn't find it already. nothing out of the ordinary though, she just has eye constant eye contact.
---
i am only gold, but if i play a tank support with the shield item, i wont get gold after laneing phase is over, because the other laners never let me last hit or burst the wave to fast. the other two items don't have that problem..
---
honestly i think steve is just preparing for franchising. it doesn't matter if they win or not, he wants to stir up drama for the next tl series/film. more publicity means more notoriety which that attracts investors and makes money. welcome to the future of the lcs
---
it's called running it down mid aka the tyler1 special
---
hmm. yep, i'm annoyed by it. did i ever imply that my annoyance has any meaning at all to what riot should do? did i ever say that my opinion has any meaning at all? i simply said what i felt and assumed that a lot of people would feel the same, and that it's something that riot co

### 2.14 Movies

In [20]:
subset = train[train['subreddits'] == 'movies']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
his role in mi3 is one of the best villians i've seen in a movie because he genuinely felt like he just didn't give a shit about other people and loved to inflict pain to get what he wanted. he nailed the lack of remorse i just love that movie. it also helps that the story was centered on it becoming very personal for hunt with the women he loved in his life.
---
i think that they had each other's detonator. it wouldn't have proven joker right if they blew up their own boat - there wouldn't be a boatload of survivors that could attest that the joker wasn't the one that pulled the trigger.
---
the flying the eagles to mordor thing is incredibly divisive. i've heard people fight back and forth whether it's a plot hole or not. i'm looking for some insight.
---
&gt; deserve: these characters are interesting enough to deserve their own movies instead of a movie that has to lean on a number of movies. even without references to the avengers, capt america is a successful stand-alone story

### 2.15 NBA

In [21]:
subset = train[train['subreddits'] == 'nba']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
ah yes way could have been :( remember when he was drafted i thought he was gonna be great but nope could have had kawhi thompson or jimmy butler
---
lebron's playbook. 1. wake up, get on the bus, give lebron ball. 2. tired as fuck, give to kyrie. 3. fuck it jr just shoot it
---
i watched the last few minutes of the sc uconn girls game last year to watch uconn get upset. actually was a decent game. that's about all i can remember. wnba is a hard no.
---
i mean that's the theory, sure, but kyrie already has a higher usage rate than kemba. not to mention there would probably be an accompanying drop in efficiency if that were the case.
---
i've seen that video before and thought the same thing lol. dude got moves though.
---
average with a solid chance works for me. there is 0 guarantee that tanking will work in the nba. draft picks are almost never sure fire. . i would rather be a competitive team with a fighters chance than hoping for luck on a pick.
---
i'm convinced. i shall chann

### 2.16 NFL

In [22]:
subset = train[train['subreddits'] == 'nfl']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
shh baby is okay you got your ring, don't ever think about that first half again. here have another sam adams and calm down shh
---
seems pretty simple to me. denver fan is our gm and the only game he watched was the super bowl. it's inexplicably bad.
---
back then i thought he was the man but since his jump to the nfl i've started to rethink it a bit. at the time i was really young so obviously i thought he was really genuine, cool, and i was starstruck.
---
nobody is better than collins imo. may not be a ball hawk but dude is monster at every other facet of his game
---
i went to the same high school as the 85 bears backup center, if that counts for something also my mom's friend is the daughter of ernie accorsi. not a player but still
---
i'm still shocked that someone paid him 15 million and it was the pats of all teams. i've watched every bills game, guys mediocre at best. any other franchise that pays him i laugh my ass off but because it's bb....we'll see.
---
the 2013 chief

### 2.17 Soccer

In [23]:
subset = train[train['subreddits'] == 'soccer']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
he wouldn't have been a bad signing if we wouldn't have paid 18m euros. for the right price he would have been acceptable.
---
and it's been the same stories all window, same players, same clubs, same prices. free back page fodder for the tabloids. big shrug.
---
not willing to negotiate a contract well after the transfer ban could be a valid reason. and lol mate you are trying to defend the incompetent barca board, a bunch of corrupt people, who are not exactly good businessmen as seen from past transfers. who knows what incompetence they showed or said.
---
oh okay.. i guess since you said that.. it must be easy as. surely there's a defensive midfielder we can buy from papua new guinea for a miserly sum of 80 mill or some scandinavian version of jermain jenas from swansea for 50 mill oh wait the entire market is full of over inflated bog standard average players and you expect spurs to splash out simply to show some ambition
---
he's damn right. playing at an acceptable level and

### 2.18 Trees

In [24]:
subset = train[train['subreddits'] == 'trees']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
i'm new to this sub and i was curious. is the 1-10 things describing how high?
---
i pay $220/oz in brooklyn for mid range trees. high grade can run from $300-$450.
---
kudi is the shit. two videos: 1) [crookers remix of day 'n night](http://www.youtube.com/watch?v=wswrepljtkc) i actually heard this before i heard the album version. 2) [kid cudi - marijuana music video](http://www.youtube.com/watch?v=lbejc9g6ktm) this video was directed by shia labeouf apparently. -edit- formatting.
---
sidebar on /r/microgrowery and www.growweedeasy.com seeds at www.seedfinder.eu quantum leds or cobs are the most efficient things going. most other led rigs are not much more efficient than hid. you want temps in the 70s. you need an enclosure to keep it dark 12 hrs a day during flower. if you cannot do that, go autoflower. also a carbon filter. also read. seriously.
---
yeah, i could've phrased it better in my post. but the munchies hit hard on the comedown every time.
---
it's from the blaze glass

### 2.19 Worldnews

In [25]:
subset = train[train['subreddits'] == 'worldnews']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
maybe he's been listening in on their telephone conversations? just a thought.
---
i just wonder how much the power bill would be if we cut all the bad emission producing energy we use to power what we're doing right now. i'd imagine it will be a lot higher.
---
the simplest explanation is that russia controls aspects of the doj and fbi, yes. not all of it entirely. often times it only takes 2-3 people being compromised to pull of shit like this. the cia infiltrates the intelligence services in other countries so it makes sense russia can do it to us. the russians successfully infiltrated our treasury department with dozens of spies. it's not unreasonable to think it happened again. britain just admitted that this routinely happens on their soil. why would america be afraid to admit it?
---
i have an acquaintance who went with her fiancé (both american). he's a rich white idiot who has bought full in to the 'dprk is a communist paradise and the media just won't show you that to kee

### 2.20 Wow

In [26]:
subset = train[train['subreddits'] == 'wow']
for comment in subset['comments'][0:20]:
    print("---")
    print(comment)

---
you are clearly on the shy side. i was too. the problem is that you're expecting other people to go out of their way to include you, but you aren't being proactive and talking to others. it's uncomfortable, but **the best way to have good friends is to be a good friend.** you want people in smaller groups to communicate with you, but you never form groups and you never reach out to other people who might be feeling equally "outsider"ish. you talk about how you wish people in raids would talk to you, but then you don't, because you'd freeze up. the truth is that this is 100% a comfort-zone issue for you. if you keep acting the same way, you are **going to have the same results**. i'm sure you have taken this shyness to be a character trait that you have and can't change. but the truth is you're more complex than that. because now we know that you are content with being shy - however, your shyness is causing other problems that you aren't content with. these problems you're having wi