<a href="https://colab.research.google.com/github/paruliansaragi/cnn-fastai/blob/master/Lesson3and4Fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dropout

precompute=True : Pre-compute the activations that come out of the last convolutional layer. Remember, **activation is a number that is calculated based on some weights/parameter that makes up kernels/filters, and they get applied to the previous layer’s activations or inputs.**

learn 

Sequential(

  (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)

(1): Dropout(**p=0.5**)

(2): Linear(in_features=1024, out_features=512)

(3): ReLU()

(4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)

(5): Dropout(p=0.5)

(6): Linear(in_features=512, out_features=120)

(7): LogSoftmax()

)

learn — This will display the layers we added at the end. These are the layers we train when precompute=True

(0), (4): BatchNorm will be covered in the last lesson

(1), (5): Dropout

(2):**Linear layer simply means a matrix multiply**. This is a matrix which has **1024 rows and 512 columns, so it will take in 1024 activations and spit out 512 activations.**

(3):ReLU — just replace negatives with zero

(6): Linear — the second linear layer that takes those 512 activations from the previous linear layer and put them through a new matrix multiply 512 by 120 and outputs 120 activations

(7): Softmax — The activation function that returns numbers that adds up to 1 and each of them is between 0 and 1:

To calculate this we do e to the power of an activation then divide that by the e to the power of all the other activations

For minor numerical precision reasons, it turns out to be better to tahe the log of the softmax than softmax directly [15:03]. That is why when we get predictions out of our models, we have to do np.exp(log_preds).

What is Dropout and what is p? [08:17]
Dropout(p=0.5)

**If we applied dropout with p=0.5 to Conv2 layer, it would look like the above. We go through, pick an activation, and delete it with 50% chance.** So p=0.5 is the probability of deleting that cell. Output does not actually change by very much, just a little bit.

**It forces it to not overfit.** In other words, when a particular activation that learned just that exact dog or exact cat gets dropped out, the model has to try and **find a representation that continues to work even as random half of the activations get thrown away every time.**

This has been absolutely critical in making modern deep learning work and just about **solve the problem of generalization.** Geoffrey Hinton and his colleagues came up with this idea loosely inspired by the way the brain works.

Have you wondered why the validation losses better than the training losses particularly early in the training? [12:32] **This is because we turn off dropout when we run inference (i.e. making prediction) on the validation set. We want to be using the best model we can.**

We do not, but PyTorch does two things when you say p=0.5. It** throws away half of the activations, and it doubles all the activations that are already there so that average activation does not change.**

**In Fast.ai, you can pass in ps which is the p value for all of the added layers.** It will not change the dropout in the pre-trained network since it should have been already trained with some appropriate level of dropout:



In [0]:
learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True)


You may have noticed, it has been adding two Linear layers [16:19]. We do not have to do that. There is xtra_fc parameter you can set. Note: you do need at least one which takes the output of the convolutional layer (4096 in this example) and turns it into the number of classes (120 dog breeds):

In [0]:
learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True, 
            xtra_fc=[]); learn 

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True, 
            **xtra_fc=[]**); 
            
            learn 

Sequential(

  (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)

(1): Linear(in_features=1024, out_features=120)

(2): LogSoftmax()

)

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True, 
            **xtra_fc=[700, 300])**; 
            
            learn
            
            
Sequential(

  (0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)

(1): Linear(in_features=1024, out_features=**700**)

(2): ReLU()

(3): BatchNorm1d(700, eps=1e-05, momentum=0.1, affine=True)

(4): Linear(in_features=700, out_features=**300**)

(5): ReLU()

(6): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True)

(7): Linear(in_features=300, out_features=120)

(8): LogSoftmax()

)

Question: Is there a particular way in which you can determine if it is overfitted? [19:53]. Yes, you can see the **training loss is much lower than the validation loss. **You cannot tell if it is too overfitted. Zero overfitting is not generally optimal.

If in doubt, use the same dropout for every fully connected layer.
There is no intution for setting different ps for earlier/later layers.

In [0]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [0]:
!pip install fastai==0.7.*

Collecting fastai==0.7.*
[?25l  Downloading https://files.pythonhosted.org/packages/50/6d/9d0d6e17a78b0598d5e8c49a0d03ffc7ff265ae62eca3e2345fab14edb9b/fastai-0.7.0-py3-none-any.whl (112kB)
[K    9% |███                             | 10kB 19.8MB/s eta 0:00:01[K    18% |█████▉                          | 20kB 2.8MB/s eta 0:00:01[K    27% |████████▊                       | 30kB 3.3MB/s eta 0:00:01[K    36% |███████████▋                    | 40kB 3.0MB/s eta 0:00:01[K    45% |██████████████▌                 | 51kB 3.3MB/s eta 0:00:01[K    54% |█████████████████▍              | 61kB 3.9MB/s eta 0:00:01[K    63% |████████████████████▍           | 71kB 4.0MB/s eta 0:00:01[K    72% |███████████████████████▎        | 81kB 3.9MB/s eta 0:00:01[K    81% |██████████████████████████▏     | 92kB 4.3MB/s eta 0:00:01[K    90% |█████████████████████████████   | 102kB 4.4MB/s eta 0:00:01[K    99% |████████████████████████████████| 112kB 4.5MB/s eta 0:00:01[K    100% |█████████████

In [0]:
!pip install torchtext==0.2.3

Collecting torchtext==0.2.3
[?25l  Downloading https://files.pythonhosted.org/packages/78/90/474d5944d43001a6e72b9aaed5c3e4f77516fbef2317002da2096fd8b5ea/torchtext-0.2.3.tar.gz (42kB)
[K    100% |████████████████████████████████| 51kB 3.6MB/s 
Building wheels for collected packages: torchtext
  Running setup.py bdist_wheel for torchtext ... [?25l- done
[?25h  Stored in directory: /root/.cache/pip/wheels/42/a6/f4/b267328bde6bb680094a0c173e8e5627ccc99543abded97204
Successfully built torchtext
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.2.3


In [0]:

from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

PATH='./'

In [0]:
!wget http://files.fast.ai/part2/lesson14/rossmann.tgz

--2018-10-14 08:07:26--  http://files.fast.ai/part2/lesson14/rossmann.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7730448 (7.4M) [text/plain]
Saving to: ‘rossmann.tgz’


2018-10-14 08:07:26 (11.6 MB/s) - ‘rossmann.tgz’ saved [7730448/7730448]



In [0]:
!tar xvfz rossmann.tgz

googletrend.csv
sample_submission.csv
state_names.csv
store.csv
store_states.csv
test.csv
train.csv
weather.csv


In [0]:
def concat_csvs(dirname):
    path = f'{PATH}{dirname}'
    filenames=glob(f"{PATH}/*.csv")

    wrote_header = False
    with open(f"{path}.csv","w") as outputfile:
        for filename in filenames:
            name = filename.split(".")[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write("file,"+line)
                for line in f:
                     outputfile.write(name + "," + line)
                outputfile.write("\n")

In [0]:
# concat_csvs('googletrend')
# concat_csvs('weather')

In [0]:
table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']

In [0]:
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]


In [0]:
from IPython.display import HTML, display


There are two types of columns:

- Categorical — It has a number of “levels” e.g. StoreType, Assortment
- Continuous — It has a number where differences or ratios of that numbers have some kind of meanings e.g. CompetitionDistance

In [0]:
for t in tables: display(t.head())


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


Unnamed: 0,Store,State
0,1,HE
1,2,TH
2,3,NW
3,4,BE
4,5,SN


Unnamed: 0,StateName,State
0,BadenWuerttemberg,BW
1,Bayern,BY
2,Berlin,BE
3,Brandenburg,BB
4,Bremen,HB


Unnamed: 0,file,week,trend
0,Rossmann_DE_SN,2012-12-02 - 2012-12-08,96
1,Rossmann_DE_SN,2012-12-09 - 2012-12-15,95
2,Rossmann_DE_SN,2012-12-16 - 2012-12-22,91
3,Rossmann_DE_SN,2012-12-23 - 2012-12-29,48
4,Rossmann_DE_SN,2012-12-30 - 2013-01-05,67


Unnamed: 0,file,Date,Max_TemperatureC,Mean_TemperatureC,Min_TemperatureC,Dew_PointC,MeanDew_PointC,Min_DewpointC,Max_Humidity,Mean_Humidity,...,Max_VisibilityKm,Mean_VisibilityKm,Min_VisibilitykM,Max_Wind_SpeedKm_h,Mean_Wind_SpeedKm_h,Max_Gust_SpeedKm_h,Precipitationmm,CloudCover,Events,WindDirDegrees
0,NordrheinWestfalen,2013-01-01,8,4,2,7,5,1,94,87,...,31.0,12.0,4.0,39,26,58.0,5.08,6.0,Rain,215
1,NordrheinWestfalen,2013-01-02,7,4,1,5,3,2,93,85,...,31.0,14.0,10.0,24,16,,0.0,6.0,Rain,225
2,NordrheinWestfalen,2013-01-03,11,8,6,10,8,4,100,93,...,31.0,8.0,2.0,26,21,,1.02,7.0,Rain,240
3,NordrheinWestfalen,2013-01-04,9,9,8,9,9,8,100,94,...,11.0,5.0,2.0,23,14,,0.25,7.0,Rain,263
4,NordrheinWestfalen,2013-01-05,8,8,7,8,7,6,100,94,...,10.0,6.0,3.0,16,10,,0.0,7.0,Rain,268


Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday
0,1,1,4,2015-09-17,1.0,1,0,0
1,2,3,4,2015-09-17,1.0,1,0,0
2,3,7,4,2015-09-17,1.0,1,0,0
3,4,8,4,2015-09-17,1.0,1,0,0
4,5,9,4,2015-09-17,1.0,1,0,0


In [0]:
train, store, store_states, state_names, googletrend, weather, test = tables


In [0]:
len(train),len(test)

(1017209, 41088)

In [0]:
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

In [0]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

In [0]:
weather = join_df(weather, state_names, "file", "StateName")


In [0]:
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

There is a Fast.ai function called add_datepart which takes a data frame and a column name. It optionally removes the column from the data frame and replaces it with lots of column representing all of the useful information about that date such as day of week, day of month, month of year, etc (basically everything Pandas gives us).

In [0]:
add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

In [0]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE']


In [0]:
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

0

In [0]:
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])

(0, 0)

In [0]:
joined = join_df(joined, googletrend, ["State","Year", "Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])

(0, 0)

In [0]:
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])

(0, 0)

In [0]:
joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])

(0, 0)

In [0]:
for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)

In [0]:

for df in (joined,joined_test):
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)

In [0]:

for df in (joined,joined_test):
    df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, 
                                                     month=df.CompetitionOpenSinceMonth, day=15))
    df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days

In [0]:
for df in (joined,joined_test):
    df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

In [0]:
for df in (joined,joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()

array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

In [0]:

for df in (joined,joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days

In [0]:
for df in (joined,joined_test):
    df.loc[df.Promo2Days<0, "Promo2Days"] = 0
    df.loc[df.Promo2SinceYear<1990, "Promo2Days"] = 0
    df["Promo2Weeks"] = df["Promo2Days"]//7
    df.loc[df.Promo2Weeks<0, "Promo2Weeks"] = 0
    df.loc[df.Promo2Weeks>25, "Promo2Weeks"] = 25
    df.Promo2Weeks.unique()

In [0]:
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

In [0]:
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

In [0]:
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]


In [0]:
df = train[columns].append(test[columns])


In [0]:
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

In [0]:
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

In [0]:
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

In [0]:
df = df.set_index("Date")


In [0]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']


In [0]:
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0).astype(int)

In [0]:
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()


In [0]:
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

In [0]:
bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)

In [0]:
fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)

In [0]:
df.reset_index(inplace=True)


In [0]:
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

In [0]:
df.drop(columns,1,inplace=True)


In [0]:
df.head()


Unnamed: 0,Date,Store,AfterSchoolHoliday,BeforeSchoolHoliday,AfterStateHoliday,BeforeStateHoliday,AfterPromo,BeforePromo,SchoolHoliday_bw,StateHoliday_bw,Promo_bw,SchoolHoliday_fw,StateHoliday_fw,Promo_fw
0,2015-09-17,1,13,0,105,0,0,0,0.0,0.0,4.0,0.0,0.0,1.0
1,2015-09-16,1,12,0,104,0,0,0,0.0,0.0,3.0,0.0,0.0,2.0
2,2015-09-15,1,11,0,103,0,0,0,0.0,0.0,2.0,0.0,0.0,3.0
3,2015-09-14,1,10,0,102,0,0,0,0.0,0.0,1.0,0.0,0.0,4.0
4,2015-09-13,1,9,0,101,0,9,-1,0.0,0.0,0.0,0.0,0.0,4.0


In [0]:
df.to_feather(f'{PATH}df')


In [0]:
df = pd.read_feather(f'{PATH}df')


In [0]:
df["Date"] = pd.to_datetime(df.Date)


In [0]:
df.columns

Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

In [0]:
joined = join_df(joined, df, ['Store', 'Date'])


In [0]:
joined_test = join_df(joined_test, df, ['Store', 'Date'])


In [0]:
joined = joined[joined.Sales!=0]


In [0]:
joined.reset_index(inplace=True)
joined_test.reset_index(inplace=True)

In [0]:
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

In [0]:
joined = pd.read_feather(f'{PATH}joined')
joined_test = pd.read_feather(f'{PATH}joined_test')

In [0]:
joined.head().T.head(40)


Unnamed: 0,0,1,2,3,4
index,0,1,2,3,4
Store,1,2,3,4,5
DayOfWeek,5,5,5,5,5
Date,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00
Sales,5263,6064,8314,13995,4822
Customers,555,625,821,1498,559
Open,1,1,1,1,1
Promo,1,1,1,1,1
StateHoliday,False,False,False,False,False
SchoolHoliday,1,1,1,1,1


Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc...

Numbers like Year , Month, although we could treat them as continuous, we do not have to. If we decide to make Year a categorical variable, we are telling our neural net that for every different “level”of Year (2000, 2001, 2002), you can treat it totally differently; where-else if we say it is continuous, it has to come up with some kind of smooth function to fit them. So often things that actually are continuous but do not have many distinct levels (e.g. Year, DayOfWeek), it often works better to treat them as categorical.
Choosing categorical vs. continuous variable is a modeling decision you get to make. In summary, if it is categorical in the data, it has to be categorical. If it is continuous in the data, you get to pick whether to make it continuous or categorical in the model.
Generally, floating point numbers are hard to make categorical as there are many levels (we call number of levels “Cardinality” — e.g. the cardinality of the day of week variable is 7).

In [0]:
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

n = len(joined); n

844338

In [0]:
dep = 'Sales'
joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy()

In [0]:

joined_test[dep] = 0
joined_test = joined_test[cat_vars+contin_vars+[dep, 'Date', 'Id']].copy()

In [0]:
for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()


In [0]:
apply_cats(joined_test, joined)


Loop through cat_vars and turn applicable data frame columns into categorical columns.
Loop through contin_vars and set them as float32 (32 bit floating point) because that is what PyTorch expects.

In [0]:
for v in contin_vars:
    joined[v] = joined[v].fillna(0).astype('float32')
    joined_test[v] = joined_test[v].fillna(0).astype('float32')

In [0]:
#Start with a small sample
idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size

150000

In [0]:
samp_size = n
joined_samp = joined.set_index("Date")

In [0]:
joined_samp.head(2)


Unnamed: 0_level_0,Store,DayOfWeek,Year,Month,Day,StateHoliday,CompetitionMonthsOpen,Promo2Weeks,StoreType,Assortment,...,Max_Wind_SpeedKm_h,Mean_Wind_SpeedKm_h,CloudCover,trend,trend_DE,AfterStateHoliday,BeforeStateHoliday,Promo,SchoolHoliday,Sales
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-07-31,1,5,2015,7,31,False,24,0,c,a,...,24.0,11.0,1.0,85.0,83.0,57.0,0.0,1.0,1.0,5263
2015-07-31,2,5,2015,7,31,False,24,25,a,a,...,14.0,11.0,4.0,80.0,83.0,67.0,0.0,1.0,1.0,6064


proc_df (process data frame) — A function in Fast.ai that does a few things:

1.** Pulls out the dependent variable, puts it into a separate variable**, and **deletes it from the original data frame.** *In other words, df do not have Sales column, and y only contains Sales column.*
2. **do_scale :** **Neural nets really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around 1. So we take our data, subtract the mean, and divide by the standard deviation to make that happen.** It returns a special object which keeps track of what mean and standard deviation it used for that normalization so you can do the same to the test set later (mapper).

3. It also **handles missing values — for categorical variable,** it becomes ID: 0 and other categories become 1, 2, 3, and so on. **For continuous variable, it replaces the missing value with the median and create a new boolean column that says whether it was missing or not.**


In [0]:
df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

In [0]:
joined_test = joined_test.set_index("Date")


In [0]:
df_test, _, nas, mapper = proc_df(joined_test, 'Sales', do_scale=True, skip_flds=['Id'],
                                  mapper=mapper, na_dict=nas)

In [0]:
df.head(10)


Unnamed: 0_level_0,Store,DayOfWeek,Year,Month,Day,StateHoliday,CompetitionMonthsOpen,Promo2Weeks,StoreType,Assortment,...,Min_Humidity,Max_Wind_SpeedKm_h,Mean_Wind_SpeedKm_h,CloudCover,trend,trend_DE,AfterStateHoliday,BeforeStateHoliday,Promo,SchoolHoliday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-07-31,1,5,3,7,31,1,25,1,3,1,...,-1.620066,0.149027,-0.142774,-1.844823,1.732492,1.724334,0.604461,1.13112,1.113717,2.04105
2015-07-31,2,5,3,7,31,1,25,26,1,1,...,-1.264031,-0.960613,-0.142774,-0.488722,1.294578,1.724334,0.926957,1.13112,1.113717,2.04105
2015-07-31,3,5,3,7,31,1,25,26,1,1,...,-1.314893,-0.960613,-1.154031,-1.392789,1.820074,1.724334,0.604461,1.13112,1.113717,2.04105
2015-07-31,4,5,3,7,31,1,25,1,3,3,...,-1.009721,0.038063,0.699941,0.415345,0.769081,1.724334,0.926957,1.13112,1.113717,2.04105
2015-07-31,5,5,3,7,31,1,4,1,1,1,...,-1.213169,-0.960613,-0.142774,-0.488722,1.469743,1.724334,0.604461,1.13112,1.113717,2.04105
2015-07-31,6,5,3,7,31,1,20,1,1,1,...,-1.213169,-0.960613,-0.142774,-0.488722,1.469743,1.724334,0.604461,1.13112,1.113717,2.04105
2015-07-31,7,5,3,7,31,1,25,1,1,3,...,-0.857134,0.370955,0.194312,0.415345,1.031829,1.724334,0.926957,1.13112,1.113717,2.04105
2015-07-31,8,5,3,7,31,1,10,1,1,1,...,-0.857134,0.370955,0.194312,0.415345,1.031829,1.724334,0.926957,1.13112,1.113717,2.04105
2015-07-31,9,5,3,7,31,1,25,1,1,3,...,-1.314893,-0.960613,-1.154031,-1.392789,1.820074,1.724334,0.604461,1.13112,1.113717,2.04105
2015-07-31,10,5,3,7,31,1,25,1,1,1,...,-1.009721,-0.183865,0.194312,-0.036688,1.294578,1.724334,0.926957,1.13112,1.113717,2.04105


In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in this post on our web site.

One approach is to take the last 25% of rows (sorted by date) as our validation set.

In [0]:
train_ratio = 0.75
# train_ratio = 0.9
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:

In [0]:
val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

In [0]:
val_idx=[0]


For any Kaggle competitions, it is important that you have a strong understanding of your metric — how you are going to be judged. In this competition, we are going to be judged on Root Mean Square Percentage Error (RMSPE).

![alt text](https://cdn-images-1.medium.com/max/1200/1*a7mJ5VCeuAxagGrHOq6ekQ.png)

In [0]:
def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

When you take the log of the data, getting the root mean squared error will actually get you the root mean square percentage error.

In [0]:
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
                                       test_df=df_test)

- As per usual, we will start by creating model data object which has a validation set, training set, and optional test set built into it. From that, we will get a learner, we will then optionally call lr_find, then call learn.fit and so forth.
- The difference here is we are not using ImageClassifierData.from_csv or .from_paths, we need a different kind of model data called ColumnarModelData and we call from_data_frame.
- PATH : Specifies where to store model files etc
- val_idx : A list of the indexes of the rows that we want to put in the validation set
- df : data frame that contains independent variable
- yl : We took the dependent variable y returned by proc_df and took the log of that (i.e. np.log(y))
- cat_flds : which columns to be treated as categorical. Remember, by this time, everything is a number, so unless we specify, it will treat them all as continuous.

For our continuous variables:

The NN takes the vector of continuous variables, put it through mat mul, and decide how many cols we want and it spits out a new length 100 rank1 tensor. Put it through relu and another mat mul (mat product).

![alt text](https://cdn-images-1.medium.com/max/1200/1*T604NRtHHBkBWFvWoovlUw.png)

We may not even use softmax for regression.

## What do we do about categorical variables?
Categorical variables [50:49]
**We create a new matrix of 7 rows and as many columns** as we choose (4, for example) and** fill it with floating numbers**. **To add “Sunday” to our rank 1 tensor with continuous variables, we do a look up to this matrix, which will return 4 floating numbers, and we use them as “Sunday”.** THIS is an embedding matrix.

![alt text](https://cdn-images-1.medium.com/max/1200/1*cAgCy5HfD0rvPDg2dQITeg.png)

**Initially, these numbers are random. But we can put them through a neural net and update them in a way that reduces the loss.** Do gradient descent and update the embedding matrix to improve the weights of the matrix so it is less random and find the best weights for each day of the week. In other words, this **matrix is just another bunch of weights** in our neural net. And matrices of this type are called “embedding matrices”. **An embedding matrix is something where we start out with an integer between zero and maximum number of levels of that category.** We index into the matrix to find a particular row, and we append it to all of our continuous variables, and everything after that is just the same as before (linear → ReLU → etc).

In [0]:
cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
#made a list of every categorical variable and its cardinality
#cardinality - the number of elements in a set or other grouping, as a property of that grouping.
cat_sz

[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

- Here is a list of every categorical variable and its cardinality.
- Even if there were no missing values in the original data, you should still set aside one for unknown just in case.
- The rule of thumb for determining the embedding size is the cardinality size divided by 2, but no bigger than 50.


We use the cardinality of each variable (that is, its number of unique values) to decide how large to make its embeddings. Each level will be associated with a vector with length defined as below.

In [0]:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
#take the cardinality of the variable divide it by 2 but dont make it bigger than 50
#and that is the heuristic behind how many columns in the embedding matrix
#By having higher dimensionality vector rather than just a single number, 
#it gives the deep learning network a chance to learn these rich representations.
emb_szs

[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

**Looking up an embedding with an index is identical to doing a matrix product between a one-hot encoded vector and the embedding matrix.** But doing so is terribly inefficient, so modern libraries implement this as taking an **integer** and **doing a look up into an array**.

![alt text](https://cdn-images-1.medium.com/max/1200/1*psxpwtr5bw55lKxVV_y81w.png)

- Here, we are asking it to create a learner that is suitable for our model data.
- 0.04 : how much dropout to use
- [1000,500] : how many activations to have in each layer
- [0.001,0.01] : how many dropout to use at later layers

So for example, day of week now becomes eight rows by four columns embedding matrix. Conceptually this allows our model to create some interesting time series models. If there is something that has a seven day period cycle that goes up on Mondays and down on Wednesdays but only for daily and only in Berlin, it can totally do that — it has all the information it needs. This is a fantastic way to deal with time series. You just need to make sure that the cycle indicator in your time series exists as a column. If you did not have a column called day of week, it would be very difficult for the neural network to learn to do mod seven and look up in an embedding matrix. It is not impossible but really hard. If you are predicting sales of beverages in San Francisco, you probably want a list of when the ball game is on at AT&T park because that is going to to impact how many people are drinking beer in SoMa. So you need to make sure that the basic indicators or periodicity is in your data, and as long as they are there, neural net is going to learn to use them.

Then pass the embedding size to the learner:



- emb_szs : embedding size
- len(df.columns)-len(cat_vars) : number of continuous variables in the data frame
- 0.04 : embedding matrix has its own dropout and this is the dropout rate
- 1 : how many outputs we want to create (output of the last linear layer)
- [1000, 500] : number of activations in the first linear layer, and the second linear layer
- [0.001, 0.01] : dropout in the first linear layer, and the second linear layer
- y_range : we will not worry about that for now

In [0]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()#doesnt work

In [0]:
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
                   [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [0]:
m.fit(lr, 3, metrics=[exp_rmspe])
#metrics : this is a custom metric which specifies a function to be called 
#at the end of every epoch and prints out a result
#0.097 rmse !!

HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))

epoch      trn_loss   val_loss   exp_rmspe  
    0      0.01328    0.000371   0.019455  
    1      0.011348   0.00363    0.0621    
    2      0.01002    0.008728   0.097925  



[array([0.00873]), 0.09792491196688566]

In [0]:
m.fit(lr, 1, metrics=[exp_rmspe], cycle_len=1)


There is a difference in a way we are calling get_learner. In imaging we just did Learner.trained and pass the data:

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True)

For these kinds of models, in fact for a lot of the models, the model we build depends on the data. In this case, we need to know what embedding matrices we have. So in this case, the data objects creates the learner (upside down to what we have seen before):

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
                   [1000,500], [0.001,0.01], y_range=y_range)

In [0]:
"""Summary of steps (if you want to use this for your own dataset) [01:17:56]:

Step 1. List categorical variable names, and list continuous variable names, and put them in a Pandas data frame

Step 2. Create a list of which row indexes you want in your validation set

Step 3. Call this exact line of code:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
         yl.astype(np.float32), cat_flds=cat_vars, bs=128, 
         test_df=df_test)
Step 4. Create a list of how big you want each embedding matrix to be

Step 5. Call get_learner — you can use these exact parameters to start with:

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
                   [1000,500], [0.001,0.01], y_range=y_range)
Step 6. Call m.fit"""


In [0]:
#OPTIONAL
from sklearn.ensemble import RandomForestRegressor
((val,trn), (y_val,y_trn)) = split_by_idx(val_idx, df.values, yl)

m = RandomForestRegressor(n_estimators=40, max_features=0.99, min_samples_leaf=2,
                          n_jobs=-1, oob_score=True)
m.fit(trn, y_trn);
preds = m.predict(val)
m.score(trn, y_trn), m.score(val, y_val), m.oob_score_, exp_rmspe(preds, y_val)

# NATURAL LANGUAGE PROCESSING

One of the things you find in NLP is there are particular problems you can solve and they have particular names. There is a particular kind of problem in NLP called “language modeling” and it has a very specific definition — it means build a model where given a few words of a sentence, can you predict what the next word is going to be.

## Language Modelling

Here we have 18 months worth of papers from arXiv (arXiv.org) and this is an example:

In [0]:
#https://github.com/fastai/fastai/blob/master/courses/dl1/lang_model-arxiv.ipynb

' '.join(md.trn_ds[0].text[:150])

'<cat> csni <summ> the exploitation of mm - wave bands is one of the key - enabler for 5 g mobile \n radio networks . however , the introduction of mm - wave technologies in cellular \n networks is not straightforward due to harsh propagation conditions that limit \n the mm - wave access availability . mm - wave technologies require high - gain antenna \n systems to compensate for high path loss and limited power . as a consequence , \n directional transmissions must be used for cell discovery and synchronization \n processes : this can lead to a non - negligible access delay caused by the \n exploration of the cell area with multiple transmissions along different \n directions . \n    the integration of mm - wave technologies and conventional wireless access \n networks with the objective of speeding up the cell search process requires new \n'

<cat> — category of the paper. CSNI is Computer Science and Networking
<summ> — abstract of the paper
Here are what the output of a trained language model looks like. We did simple little tests in which you pass some priming text and see what the model thinks should come next:

sample_model(m, "<CAT> csni <SUMM> algorithms that")
...use the same network as a single node are not able to achieve the same performance as the traditional network - based routing algorithms . in this paper , we propose a novel routing scheme for routing protocols in wireless networks . the proposed scheme is based ...

It learned by reading arXiv papers that somebody who is writing about computer networking would talk like this. Remember, it started out not knowing English at all. It started out with an embedding matrix for every word in English that was random. By reading lots of arXiv papers, it learned what kind of words followed others.

Here we tried specifying a category to be computer vision:

sample_model(m, "<CAT> cscv <SUMM> algorithms that")
...use the same data to perform image classification are increasingly being used to improve the performance of image classification algorithms . in this paper , we propose a novel method for image classification using a deep convolutional neural network ( cnn ) . the proposed method is ...
It not only learned how to write English pretty well, but also after you say something like “convolutional neural network” you should then use parenthesis to specify an acronym “(CNN)”.

sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> on ")
...the performance of deep learning for image classification <eos>
sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> on ")
...the performance of wireless networks <eos>
sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> towards ")
...a new approach to image classification <eos>
sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> towards ")
...a new approach to the analysis of wireless networks <eos>
A language model can be incredibly deep and subtle, so we are going to try and build that — not because we care about this at all, but because we are trying to create a pre-trained model which is used to do some other tasks. For example, given an IMDB movie review, we will figure out whether they are positive or negative. It is a lot like cats vs. dogs — a classification problem. So we would really like to use a pre-trained network which at least knows how to read English. So we will train a model that predicts a next word of a sentence (i.e. language model), and just like in computer vision, stick some new layers on the end and ask it to predict whether something is positive or negative.

# IMDB

What we are going to do is to train a language model, making that the pre-trained model for a classification model. In other words, we are trying to leverage exactly what we learned in our computer vision which is how to do fine-tuning to create powerful classification models.

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

In [0]:
!wget http://files.fast.ai/data/aclImdb.tgz

--2018-10-14 09:49:12--  http://files.fast.ai/data/aclImdb.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145982645 (139M) [text/plain]
Saving to: ‘aclImdb.tgz’


2018-10-14 09:49:16 (36.1 MB/s) - ‘aclImdb.tgz’ saved [145982645/145982645]



In [0]:
!tar xzvf aclImdb.tgz

In [0]:
!mkdir data
!mv aclImdb data

In [0]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [0]:
trn_files = !ls {TRN}
#Let's look inside the training folder...

trn_files[:10]

['0_0.txt       1562_10.txt  24997_0.txt\t34371_0.txt  43748_0.txt  6248_7.txt',
 '0_3.txt       15621_0.txt  24998_0.txt\t3437_1.txt   43749_0.txt  6249_0.txt',
 '0_9.txt       1562_1.txt   24999_0.txt\t34372_0.txt  437_4.txt\t  6249_2.txt',
 '10000_0.txt   15622_0.txt  25000_0.txt\t34373_0.txt  43750_0.txt  6249_7.txt',
 '10000_4.txt   15623_0.txt  2500_0.txt\t34374_0.txt  4375_0.txt   624_9.txt',
 '10000_8.txt   15624_0.txt  25001_0.txt\t34375_0.txt  43751_0.txt  6250_0.txt',
 '1000_0.txt    15625_0.txt  2500_1.txt\t34376_0.txt  4375_1.txt   6250_10.txt',
 '10001_0.txt   15626_0.txt  25002_0.txt\t34377_0.txt  43752_0.txt  6250_1.txt',
 '10001_10.txt  15627_0.txt  25003_0.txt\t34378_0.txt  43753_0.txt  625_0.txt',
 '10001_4.txt   15628_0.txt  25004_0.txt\t3437_8.txt   43754_0.txt  625_10.txt']

In [0]:
review = !cat {TRN}{trn_files[6]}#...and at an example review.


review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

In [0]:
#Now we'll check how many words are in the dataset.

!find {TRN} -name '*.txt' | xargs cat | wc -w

!find {VAL} -name '*.txt' | xargs cat | wc -w


17486581
5686719


Before we can do anything with text, we have to turn it into a list of tokens. Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn it into a list of words — this is called “tokenization” in NLP. A good tokenizer will do a good job of recognizing pieces in your sentence. Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate. Spacy does a lot of NLP stuff, and it has the best tokenizer Jeremy knows. So Fast.ai library is designed to work well with the Spacey tokenizer as with torchtext.

In [0]:
! python -m spacy download en

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 48.5MB/s 
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25l- \ | done
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')



In [0]:
spacy_tok = spacy.load('en')

In [0]:

' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they see

Creating a field [01:41:01]
A field is a definition of how to pre-process some text.

In [0]:
TEXT = data.Field(lower=True, tokenize="spacy")


- lower=True — lowercase the text
- tokenize=spacy_tok — tokenize with spacy_tok

Now we create the usual Fast.ai model data object:

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of LanguageModelData, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use VAL_PATH for that too.

As well as the usual bs (batch size) parameter, we also now have bptt; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [0]:
bs=64; bptt=70


- PATH : as per usual where the data is, where to save models, etc
- TEXT : torchtext’s Field definition
- **FILES : list of all of the files we have: training, validation, and test (to keep things simple, we do not have a separate validation and test set, so both points to validation folder)
- bs : batch size
- bptt : Back Prop Through Time. It means how long a sentence we will stick on the GPU at once
- min_freq=10 : In a moment, we are going to be replacing words with integers (a unique index for every word). If there are any words that occur less than 10 times, just call it unknown.


After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which unique words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id.

In [0]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

In [0]:
#pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)
#Here are the: # batches; # unique tokens in the vocab; length of the dataset; # of words

(4583, 37392, 1, 20540756)

itos is sorted by frequency except for the first two special ones. Using vocab, torchtext will turn words into integer IDs for us :

In [0]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [0]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

In [0]:
md.trn_ds[0].text[:12]
#Note that in a LanguageModelData object there is only one item 
#in each dataset: all the words of the text joined together.

['depressing',
 'and',
 'meaningless',
 'pap',
 '.',
 'it',
 'is',
 'like',
 'one',
 'of',
 'those',
 'french']

In [0]:
#torchtext will handle turning this words into integer IDs for us automatically.
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
  2086
     5
  3913
 16845
     4
    11
     9
    47
    37
     7
   164
   723
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our LanguageModelData object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

## Batch size and BPTT [01:47:40]
What happens in a language model is even though we have lots of movie reviews, they all get concatenated together into one big block of text. So we predict the next word in this huge long thing which is all of the IMDB movie reviews concatenated together.
![alt text](https://cdn-images-1.medium.com/max/1200/1*O-Kq1qtgZmrShbKhaN3fTg.png)

- We split up the concatenated reviews into batches. In this case, we will split it to 64 sections
- We then move each section underneath the previous one, and transpose it.
- We end up with a matrix which is 1 million by 64.
- We then grab a little chunk at time and those chunk lengths are approximately equal to BPTT. Here, we grab a little 70 long section and that is the first thing we chuck into our GPU (i.e. the batch).

- We grab our first training batch by wrapping data loader with iter then calling next.
- We got back a 75 by 64 tensor (approximately 70 rows but not exactly)
- A neat trick torchtext does is to randomly change the bptt number every time so each epoch it is getting slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.
- The target value is also 75 by 64 but for minor technical reasons it is flattened out into a single vector.

In [0]:
next(iter(md.trn_dl))


(Variable containing:
   2086    148     44  ...     321      3      3
      5    502  10927  ...       7    113      5
   3913    158   2863  ...       2     11      2
         ...            ⋱           ...         
     65     27  33329  ...     186      3   2370
    860      6     17  ...     596  10687    905
      4    314     41  ...      20      3    931
 [torch.cuda.LongTensor of size 72x64 (GPU 0)], Variable containing:
      5
    502
  10927
   ⋮   
      2
      5
  14962
 [torch.cuda.LongTensor of size 4608 (GPU 0)])

# Create a model


This is what our embedding matrix looks like:

![alt text](https://cdn-images-1.medium.com/max/1200/1*6EHxqeSYMioiLEQ5ufrf_g.png)

- It is a high cardinality categorical variable and furthermore, it is the only variable — this is typical in NLP
- The embedding size is 200 which is much bigger than our previous embedding vectors. Not surprising because a word has a lot more nuance to it than the concept of Sunday. Generally, an embedding size for a word will be somewhere between 50 and 600.

In [0]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of momentum (which we’ll learn about later) don’t work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than its default of 0.9. Any time you are doing NLP, you should probably include this line:

In [0]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

Fast.ai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below — you just have to experiment…

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

In [0]:
learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=0.05,
                       dropout=0.05, wdrop=0.1, dropoute=0.02, 
                       dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

- There is another kind of way we can avoid overfitting that we will talk about in the last class. For now, learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1) works reliably so all of your NLP models probably want this particular line.
- learner.clip=0.3 : when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, this will not allow them be more than 0.3. This is a cool little trick to prevent us from taking too big of a step.
- Details do not matter too much right now, so you can use them as they are.

In [0]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)


HBox(children=(IntProgress(value=0, description='Epoch', max=15), HTML(value='')))

 40%|███▉      | 1833/4583 [08:29<13:04,  3.50it/s, loss=5.19]

KeyboardInterrupt: ignored

In [0]:
learner.save_encoder('adam3_20_enc')
learner.load_encoder('adam3_20_enc')

Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.

In [0]:
math.exp(4.165)
#pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))


## Testing [02:04:53]
We can play around with our language model a bit to check it seems to be working OK. First, let’s create a short bit of text to ‘prime’ a set of predictions. We’ll use our torchtext field to numericalize it so we can feed it to our language model.

In [0]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.preprocess(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

In [0]:
#We haven't yet added methods to make it easy to test a language model, 
#so we'll need to manually go through the steps.
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs


In [0]:
#Let's see what the top 10 predictions were for the next word after our short text:

nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

In [0]:
#...and let's see if our model can generate a bit more text all by itself!

print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

Sentiment [02:05:09]
So we had pre-trained a language model and now we want to fine-tune it to do sentiment classification.

To use a pre-trained model, we will need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

In [0]:
#TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))


sequential=False tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

This time, we need to not treat the whole thing as one big piece of text but every review is separate because each one has a different sentiment attached to it.

splits is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at lang_model-arxiv.ipynb to see how to define your own fastai/torchtext datasets.

In [0]:

IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')
t = splits[0].examples[0]
t.label, ' '.join(t.text[:16])


fastai can create a ModelData object directly from torchtext splits.

In [0]:
md2 = TextData.from_splits(PATH, splits, bs)
#Now you can go ahead and call get_model that gets us our learner. 
#Then we can load into it the pre-trained language model (load_encoder).
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_10_enc')

Because we’re fine-tuning a pretrained model, we’ll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

In [0]:
m3.clip=25.
lrs=np.array([1e-4,1e-3,1e-2])
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

We make sure all except the last layer is frozen. Then we train a bit, unfreeze it, train it a bit. The nice thing is once you have got a pre-trained language model, it actually trains really fast.

In [0]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')
m3.load_cycle('imdb2', 4)
accuracy_np(*m3.predict_with_targs())


A recent paper from Bradbury et al, Learned in translation: contextualized word vectors, has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.

![alt text](https://cdn-images-1.medium.com/max/1200/1*PotEPJjvS-R4C5OCMbw7Vw.png)
