### This is using FastAI v1 codebase

In [1]:
from fastai.text import *
import pandas as pd
import numpy as np

### Install sklearn package separately, if necessary

In [2]:
# Uncomment below if you need to install sklearn
# !pip install sklearn

from sklearn.model_selection import train_test_split

### Begin reading in and prepping data

In [3]:
#PATH = "~/Desktop/magellan-ai/bullwinkle/data/has_ads/"
PATH = "storage/sample/"

df_trn = pd.read_csv(f'{PATH}training_sample.csv')

df_tst = pd.read_csv(f'{PATH}holdout_for_Sampling_test.csv')
df_trn.shape, df_tst.shape

((25000, 3), (500, 3))

### Get a list of unique labels (brand_id classes) from both Train/Valid and Test datasets

In [4]:
# keep a list of unique lables that we can use later when testing
labels = df_trn['brand_id'].unique()
labels.sort()

# Do the same for the Test (holdout) data
labels_tst = df_tst['brand_id'].unique()
labels_tst.sort()

labels.shape, labels_tst.shape

((5316,), (382,))

### See if any Test labels are unique to Train/Valid labels

In [5]:
s = set(labels)
t = set(labels_tst)
if (t.issubset(s) == False) or (s.issuperset(t) == False):
    labels_tst_unique = sorted(t.difference(s))
    print("There are " + str(len(labels_tst_unique)) + " unique labels in Test that are NOT found in Train/Valid")
else:
    print("No unique Test labels")

There are 66 unique labels in Test that are NOT found in Train/Valid


### Display some of the Test Data which has Unique labels (not found in TRN Dataset)

In [6]:
pd.options.display.max_colwidth = 0
pd.options.display.html.use_mathjax = False
df_tst[df_tst['brand_id'].isin(labels_tst_unique)]

Unnamed: 0,brand_id,text,id
1,2727,"our good friends at Johnny O Welcome you to this Episode now the iconic Johnny O clothing brand logo of the Surfer and his longboard first caught my eye several years ago, but it's a signature Johnny O style where West Coast meets East Coast prep. That truly changed the game for me, and I've been wearing Johnny O ever since. And now our listeners Khun use promo code Rich Take at checkout for 20 per cent off your first order at johnny dachau dot com. That's 20% off the regular price at johnny dachau dot com. Used the promo code Rich Take at checkout for 20 per cent off your first order.",4619134
14,11740,"Support for this Podcast comes from red bubble. Everyone's got a thing. Maybe you love dinosaurs or you're obsessed with doughnuts or you live and breathe super Weird. True crime shows. This is who you are, and if you want to express It, you should come to read. Bubble Read. Bubble is a marketplace with thousands of artists from around the world who are into dinosaurs, doughnuts and true crime shows to, and they sell T shirts, stickers, masks, pillows, posters and more featuring original designs that celebrate them. So you'll find stuff that you can trust will be perfect because it's the thing you love made by an artist who loves It to red bubble dot com Find your thing,",9929340
17,36095,"Podcast one presents. This is a collect call from Sing Sing. My name is John Lennon. I'm locked up for selling drugs on committing murder. I'm also a contributor project, so I'm a writer on and prisoner, a man trying to stay focused on talk about issues of substance with gates slamming prisoner screaming and pH blaring in the background Get new episodes every Wednesday on Spotify Podcast one and Apple podcasts.",8818521
18,1479,"I'll tell you now right are they sponsor Hanes, since 1901 never heard of us with comforting to they know got these new Comfort Flex fit underwear getting Jason took me 2 years to develop okay they want to do it right now too there's nothing that tells the 3-piece set better you got a breathable pouch I need that W2 pouch a Rougarou it that's it what that's is it and all that bacon next stuff and it's so easy so he can send me more and more everyday I'm running out of options America's number one brand of underwear comfort fit Comfort Flex fit comfortable supportive fantastic underwear I think so",415858
29,15467,Also brought to you by Winter Airport Parking. Fastest way in and out of Philly International Airport and Wilmington University. Experience The Wilm you difference by visiting Wilm you dot e d u.,6163363
48,2318,this serial killers podcast is sponsored by American public media and their podcast in the Dark Spark have podcast but we do think you'll enjoy it and its first season in the dark earned a Peabody for its in-depth study of the Jacob wetterling kidnapping in season 2 L exploring new story with life or death consequences you can find in the dark at Apple podcast Stitcher or wherever you listen to podcasts,560525
67,30359,"Welcome, hot ones game show. It's just hot sauces questions. And eventually, some. Harper didn't know it's possible for your eyeballs to sweat. Shaun Evans hosts Hot ones. The game show a new series on Tru TV with a no holds barred election right around the corner. Take a look back at some of the most hard foot.",7364347
72,19694,support for NPR and the following message come from the Showtime documentary film Hitsville. The Making of Motown Discover how the soul of a city defined the rhythm of a nation in this love letter to the music that took the country by storm hits Bill. The Making of Motown Premiers August 23rd only on Showtime.,5308063
74,19734,"This episode is brought to you by get prime, get prime Help software teams, accelerate their velocity and release products faster by turning historical. Get data into easy to understand insides reports. Because past performance predicts future performance. Get prime can examine your get data, identify bottlenecks, compare sprints of releases over time and enable data driven. Discussion is about engineering and product development. Shift faster because you know more not because you're rushing get started at get prime dot com slash Changelog. That's G I t P r. I am e dot com slash jeans log again get prime dot com slash Changelog.",5742390
80,29908,with panting is rosewater collection. Guy can really feel how much more hydrated my hair is and it sulfate paraben die and mineral oil free. Which makes me feel good because who needs all those additives? Experience something new and discover what's good with the Pantene You treat Blends collection.,7616336


## Begin Classification Predicition against Holdout (Test) Dataset

### Load the Fully-trained TC Learner that was originally saved with the last retraining with learner.export()

In [7]:
learn = load_learner(PATH, 'TC_export')

### Select range of rows to include from Test Dataset

In [12]:
R_START = 250
R_END = 500

### Iterate thru Fastai's learn.predict() method for each row in selected Test Dataset

In [13]:
pred_lst=[]
trg_lst=df_tst.iloc[R_START:R_END]['brand_id']
for t in df_tst.iloc[R_START:R_END]['text']:
    pred = learn.predict(t)
    pred_lst.append(int(str(pred[0])))

### Determine which classification predictions were correct

In [14]:
b = [x==y for (x,y) in zip(list(trg_lst), pred_lst)]
res_lst = ['ok' if i==True else 'XXX' for i in b]
pct = res_lst.count('ok')/len(res_lst)

### Display results

In [11]:
d={'Target': trg_lst, 'Prediction': pred_lst, 'Result':res_lst}
pd.DataFrame(data=d)

Unnamed: 0,Target,Prediction,Result
0,6876,9036,XXX
1,25031,1783,XXX
2,4481,15348,XXX
3,42481,42481,ok
4,5144,5144,ok
5,44,44,ok
6,42757,42757,ok
7,13650,13650,ok
8,42529,42529,ok
9,43986,43986,ok


In [11]:
print('Accuracy: ' + str(pct))

Accuracy: 0.78


# Section Below can be used for Manual Testing

### Use below if you want to RANDOMLY choose text from the Test Dataset to classify

In [44]:
pd.options.display.max_colwidth = 0
pd.options.display.html.use_mathjax = False

BR_ID = np.random.choice(np.array(labels_tst),1,False)[0]
    
t = str(df_tst[df_tst['brand_id']==BR_ID].iloc[0].text)

print(BR_ID, t)

28370 Hey, you guys! A Annie's Emmy nominated groundbreaking docuseries. Leah Remini. Scientology in the Aftermath returns for Season two on August 15th with 10 all new episodes. This show was riveting and revealing. I watched all of the first season and ah, second season is coming, as I mentioned August 15th on a Any Leah Remini. Scientology in the aftermath follows Leah Remini along with high level former Scientology executives and church members, as they delve deep into shocking stories of abuse, heartbreak and harassment experienced by those who have left the church and spoken publicly about their experiences. This season, Leah Remini continues her quest to give a voice to victims of the Church of Scientology. The Siri's also explores accounts of former members whose lives have been significantly impacted by the church's practice. Check it out, ladies and gentlemen, Remedy is helping people take action, turning survivors into fighters, revealing truths and seeking justice. She is r

In [45]:
learn.predict(t)

(Category 831,
 tensor(264),
 tensor([8.8164e-09, 1.1244e-07, 3.7135e-06,  ..., 5.3147e-07, 9.4112e-06,
         5.2382e-06]))

### Use below if you want to choose a SPECIFIC label the Test Dataset to classify

In [47]:
BR_ID = 28370
t = str(df_trn[df_trn['brand_id']==BR_ID].text)
#t = str(df_tst[df_tst['brand_id']==BR_ID].text)
print(BR_ID, t)

28370 518     A and E's Leah Remini, Scientology and the aftermath is coming back for Season two. And let me tell you, I've met Leah and she is a firecracker and she is honest and she is brave and she does not give a rat's You know what? She just tells the truth. She's she Honestly, I'm a huge fan and I won't be watching and I will fight for this woman. So you knees. Emmy nominated groundbreaking docuseries Leah Remini, Scientology and the Aftermath returns for Season two on August 15th with 10 all new episodes, We all you know, we all want to know. We all want to know what goes on behind those Scientology rolls. Leah Remini, Scientology and the aftermath. Feliz Leah Remini, along with high level former Scientology executives and church members as they delve deep into shocking stories of abuse, heartbreak on harassment experienced by those who have left the church and spoken publicly about their experiences, this bullying and a whole nother. The fact that this is a thing. It's a closed

In [48]:
learn.predict(t)

(Category 28370,
 tensor(1665),
 tensor([2.0339e-08, 2.4228e-06, 3.1566e-05,  ..., 4.2645e-06, 9.2083e-06,
         2.3035e-05]))

### Use below if you want to supply YOUR OWN TEST STRING to run thru the classifier:

In [40]:
t="We canceled all the fraudulent orders that were made on your computer and refunded the money where possible. So I take a look at I get freaked out cause you know, my bank account is linked to that Should So is my credit card. I don't want my Amazon hacked. I log in and I realize there's nothing foul. No one robbed me, but all the orders I meet yesterday disappeared all the orders I made, so I figure, Oh, shit these fuckin dummies Because I've been using a VPN. They felt that I, some random guy locked in from his baker Stan or something into my Amazon account. And they immediately just assumed that I was hacked and without consulting me, decided not just nuke This. Guys account in all of his orders and payments processing because you know what? YouTube. You know, usually websites at least ask a like When you try New logging on Twitter, Twitter will email you saying, Hey, here's a New log in."

In [41]:
learn.predict(t)

(Category 2331,
 tensor(512),
 tensor([7.3616e-06, 3.4632e-09, 3.3857e-05,  ..., 3.1489e-09, 4.5093e-08,
         6.3345e-10]))