# Notebook 2 - Text Preprocessing & Vectorization
The second step in our process was exploring methods for preprocessing and vectorizing the text in the DESCRIPTION_GOOD column.  Preprocessing removes unnecessary punctuation and stop words from each body of text, as punctuation and stop words (like 'the' or 'and') can distort the results of vectorization.  Vectorization converts the DESCRIPTION_GOOD column into a series of vectors, whose value represents how fully a certain word can 'describe' a data point.  For example, if a trade involving 'apples' is found in the dataset, chances are the word vectorizer are going to assign a high value to the word 'apple' in the vectorized.  This is because more likely than not, the word 'apple' can be used to identify this trade in relation to the rest of the trades in the dataset.

In [None]:
# pip installations - These are necessary for the notebook to run.  Labs20 group would like to fix this by creating a container for the SageMaker instance.
!pip install --upgrade dask
!pip install fsspec
!pip install --upgrade s3fs
!pip install numpy
!pip install pymystem3
!pip install joblib
!pip install pymorphy2==0.8
!pip install dask-ml

In [1]:
# IMPORTS

# dataframe
import dask.dataframe as dd
import pandas as pd

# train/test split
from sklearn.model_selection import train_test_split as tts

# DESCRIPTION_GOOD preprocessing
import nltk
nltk.download("stopwords")
#--------#
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation

# text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# KMeans clustering
import dask_ml.cluster as dask_ml_model

# saving vectorizer/model to S3 bucket
import tempfile
import boto3
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




In [2]:
# Import df_trade_data_desc.csv from S3 bucket. Following Labs18 steps, dataframe must be read as as a Dask dataframe and then saved to memory,
# where it can be manipulated as a pandas dataframe.
df = dd.read_csv('s3://labs20-arms-bucket/data/df_trade_data_desc.csv/*.csv')

In [3]:
# Executing .compute() on a dask dataframe reads it to memory, allowing it to be manipulated as a pandas dataframe.
df = df.compute()

In [4]:
df.to_csv('s3://labs20-arms-bucket/data/full_df.csv')

### Cleaning Invalid Tax ID Numbers
Labs20 ARMS group wanted to save as much data as possible.  This dataset contains invalid values in the `CONSIGNOR_INN` column, including `None`, `0`, `00`, `ИНН/КПП НЕ О`, as well as null values.  Dropping all rows with invalid/NaN Tax ID numbers loses 538,605 rows of data.

We wanted to improve this number.  To do this, we created a clean dictionary of every valid Company Name:Company INN pair in the dataset, and then mapped it to the CONSIGNOR_INN column of the man dataframe.  The results were very positive:
- Initial Dataframe Shape = 10,750,631 rows
- Initial Dataframe Shape After Dropping Unuseable Values = 10,212,026 rows
- Post Dictionary-Cleaning Dataframe Shape = 10,390,008 rows

After cleaning and dropping NaNs, we lost only 360,623 rows of data.  We were able to save 177,224 rows of data by using the cleaning dictionary instead of dropping every row with an invalid/null INN.

In [5]:
# check size of dataframe
df.shape

(10750631, 4)

In [8]:
# remove all rows with invalid/NaN INNs in 'CONSIGNOR_INN' column from dataframe
invalid_values = ['None', '00', 'ИНН/КПП НЕ О', '0']
for i in invalid_values:
    df = df[df['CONSIGNOR_INN'] != i]
df.dropna(inplace=True)

In [7]:
# check resulting shape
df.shape

(10212026, 4)

In [5]:
# RELOAD DATAFRAME
# df = dd.read_csv('s3://labs20-arms-bucket/data/df_trade_data_desc.csv/*.csv')
# df = df.compute()

# Create subslice of dataframe for dictionary
dict_df = df[['CONSIGNOR_NAME', 'CONSIGNOR_INN']]

# clean columns of dict_df, remove all invalid values from CONSIGNOR_INN column
invalid_values = ['None', '00', 'ИНН/КПП НЕ О', '0']
for i in invalid_values:
    dict_df = dict_df[dict_df['CONSIGNOR_INN'] != i]

# drop all null values
dict_df.dropna(inplace=True)
# sort values by 'CONSIGNOR_NAME'
dict_df.sort_values("CONSIGNOR_NAME", inplace = True) 
# dropping ALL duplicte 'CONSIGNOR_NAME' values from dictionary
dict_df.drop_duplicates(subset ="CONSIGNOR_NAME", keep = 'first', inplace = True) 

# create list of 2-item lists: [CONSIGNOR_NAME, CONSIGNOR_INN]
new_list = dict_df.values.tolist()
# create dictionary out of list of lists
# for every list in the list of lists, take the first item in list (CONSIGNOR_NAME)
# and add it to index position of dictionary, take second term ('CONSIGNOR_INN') and add it to value position of dictionary
# cannot use pandas.to_dict() because it adds column names to dictionary; only want indexes/values
new_dict = {t[0]:t[1] for t in new_list}

In [6]:
# map new_dict to 'CONSIGNOR_INN' column of main dataframe
df['CONSIGNOR_INN'] = df['CONSIGNOR_NAME'].map(new_dict)

In [7]:
# drop null values from dataframe
# although our dictionary mapping created 323,365 NaN values in the CONSIGNOR_INN column, we cleaned up over 522,000 'None' value INNs
# coming up with 177,000 more rows as a result of this process
df.dropna(inplace=True)
df.shape

(10390008, 4)

In [8]:
df.head()

Unnamed: 0,CONSIGNOR_NAME,DECLARATION_NUMBER,CONSIGNOR_INN,DESCRIPTION_GOOD
0,ООО ЛЕСАН ФАРМА,10313140/140518/0012639,6161051021,ЛЕКАРСТВЕННЫЕ СРЕДСТВА РАСФАСОВАННЫЕ В УПАКОВК...
2,ООО ПКФ ТЕХНОМАРКЕТ,10404080/030817/0010907,1639035552,РАДИАТОР ВОДЯНОЙ 4Х РЯДНЫЙ ИЗ ЧЕРНЫХ МЕТАЛЛОВ ...
3,АО УРГАЛУГОЛЬ,10006090/190318/0001005,2710001186,УГОЛЬ КАМЕННЫЙ БИТУМИНОЗНЫЙ ПРОЧИЙ ПРЕДЕЛЬНЫЙ ...
4,ООО ЛЕСАН ФАРМА,10313010/110816/0012167,6161051021,МАРЛЯ И ИЗДЕЛИЯ ИЗ МАРЛИ РАСФАСОВАНЫ ДЛЯ РОЗНИ...
5,ООО НАША ПОЧТА,10313140/130317/0002199,9102182010,ОДЕЖДА И ЕЕ ПРИНАДЛЕЖНОСТИ ИЗ ПЛАСТМАСС БАХИЛ...


### Create lemmatizer, stopwords list, and punctuation remover.  

Simply put, mystem.lemmatize() converts the DESCRIPTION_GOOD column into 'tokens', which
are fed to the vectorizer for analysis.  Removing stopwords prevents the vectorizer from assigning high values to generic words that are most likely found
in all entries.  For example, a vectorizer that thinks the Russian equivalent of 'the' is a good way to define a body of text isn't very effective.  Likewise,
a vectorizer that thinks ',' or '?' is a good way to define a body of text isn't ver effective.

We ran the vectorizer a couple of times without adding stopwords to the filter.  Resulting columns contained a lot of numbers and generic Russian terms, such as `1`, `01`, `использование`, & `масса`.  Based on those results, we expanded the list of stop words to include those words.

For future groups, adding generic Russian trade words to the stopwords list would make the model run more effectively.  Our model found the Russian-equivalent 
of some trade & measurement-specific terms, like 'meter', 'gram', or the word 'pallet', were used to define the text corpus of many of our trades.  Although not stop words, removal of these generic trade/measurement terms from the text corpus before vectorization will likely improve the results of the model.

In [None]:
#create stem and stopwords list
mystem = Mystem() 
russian_stopwords = stopwords.words("russian")

# https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist
# add trade-specific stopwords to list
newStopWords = ['г', '№', '10', '1', '20', '30', 'кг', '5', 'см',
                '100', '80', '2', 'х', 'l', 'м', '00', '000'
                '1.27', '2011.10631', '4', '12', '3', 'фр', 'количество',
                'становиться', 'мм', 'вид', 'упаковка', 'получать',
                'прочий', 'использование', 'масса', 'размер', 'черный',
                '6', '8', '7', '50', '40', '25', 'коробка', 'поддон',
                'вдоль', '250', '65', '85', '15', '35', '40', '45',
                '55', '60', '70', '75', 'м3', '13', '0', '14',
                '16', '18', 'm2', 'п', 'р', 'т', 'тип', 'являться',
                'размер', 'cm', 'm', '01', '02', '03', '04', '05',
                '06', '07', '08', '09', '24', '27']
russian_stopwords.extend(newStopWords)

In [10]:
#define function for preprocessing text - to be used later in notebook
#function will remove Russian stop words and any punctuation not removed in cleaning_trade_data_desc_kmeans.ipynb
def preprocess_text(text):
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    text = " ".join(tokens)
    return text

In [11]:
# split data into training/test sets
train, test = tts(df, test_size=0.2)

In [12]:
# saves training dataframe to S3
train.to_csv('s3://labs20-arms-bucket/data/df_trade_desc_unprocessed_trainIF2.csv')

In [13]:
# saves test dataframe to S3
# df_trade_desc_processed_test will be used to test the effectiveness of our final product.
test.to_csv('s3://labs20-arms-bucket/data/df_trade_desc_unprocessed_testIF2.csv')

In [15]:
#create list for preprocessed text to be appended to
processed_text_list = []

In [16]:
# This for loop applies the preprocess_text function to every entry in the 'DESCRIPTION_GOOD' column
# appends result to list 'processed_text_list' (above)
for i in range(len(train['DESCRIPTION_GOOD'])):
    x = train['DESCRIPTION_GOOD'].iloc[i]
    if isinstance(x, str):
        processed_text_list.append(preprocess_text(x))
    else:
        processed_text_list.append(preprocess_text(x.astype(str)))

In [17]:
# convert list of preprocessed text to dataframe
# to be concatenated onto original dataframe as 'PROCESSED_TEXT' column
df1 = pd.DataFrame({'PROCESSED_TEXT':processed_text_list})

In [18]:
# reset indices of both dataframes for merge
# not sure why we had to do this, but running the following three commands gave us the results we wanted
df1 = df1.reset_index()
train = train.reset_index()
train['index'] = train.index

In [19]:
# check df1
df1.head()

Unnamed: 0,index,PROCESSED_TEXT
0,0,пиломатериалыдоска еловый picea abies обрезная...
1,1,сульфат калий калий серонокислый технический к...
2,2,швеллер бампер ваз 21900280301501 шт
3,3,лесоматериал праспиливать вдольнестроганыенелу...
4,4,телефонный проводной трубка сбор связь бортпро...


In [20]:
# compare df1.head() against df.head()
train.head()

Unnamed: 0,index,CONSIGNOR_NAME,DECLARATION_NUMBER,CONSIGNOR_INN,DESCRIPTION_GOOD
0,0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,10210310/161116/0016793,3528006408,ПИЛОМАТЕРИАЛЫДОСКА ЕЛОВАЯ(PICEA ABIES) ОБРЕЗНА...
1,1,АО ПИКАЛЁВСКАЯ СОДА,10210200/080218/0004332,4715022874,СУЛЬФАТ КАЛИЯ (КАЛИЙ СЕРОНОКИСЛЫЙ) ДЛЯ ТЕХНИЧЕ...
2,2,ООО ИНТЕР-ТРАНС,10412110/140817/0008251,6324057625,ШВЕЛЛЕР БАМПЕРА ДЛЯ А/М ВАЗ 21900280301501 6 ШТ.
3,3,ООО ЗЕЛЕНЫЙ СВЕТ,10607040/310718/0012337,3849029492,ЛЕСОМАТЕРИАЛЫ Х/ПРАСПИЛЕННЫЕ ВДОЛЬНЕСТРОГАНЫЕН...
4,4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,10508010/010217/0001637,6608003013,ТЕЛЕФОННАЯ ПРОВОДНАЯ ТРУБКА В СБОРЕ ДЛЯ СВЯЗИ ...


In [21]:
# check df1
df1.iloc[775392]

index                                                        775392
PROCESSED_TEXT    часть кузов транспортный средство употребление...
Name: 775392, dtype: object

In [22]:
# compare random records of df against same record in df1
train.iloc[775392]

index                                                            775392
CONSIGNOR_NAME                                         ООО ТРАНС ИМПОРТ
DECLARATION_NUMBER                              10225030/050716/0002277
CONSIGNOR_INN                                                6025040768
DESCRIPTION_GOOD      ЧАСТИ КУЗОВОВ ДЛЯ ТРАНСПОРТНЫХ СРЕДСТВ БЫВШИЕ ...
Name: 775392, dtype: object

In [23]:
# check shapes of both dataframes
print("main dataframe shape:", train.shape, '\n',"processed text dataframe shape:", df1.shape)

main dataframe shape: (8312006, 5) 
 processed text dataframe shape: (8312006, 2)


In [24]:
# merge preprocessed text to original dataframe
df_merge = pd.concat([train, df1], axis=1, join='inner')

In [25]:
# confirm correct shape of merged dataframe
df_merge.shape

(8312006, 7)

In [26]:
# check data in merged dataframe
df_merge.head()

Unnamed: 0,index,CONSIGNOR_NAME,DECLARATION_NUMBER,CONSIGNOR_INN,DESCRIPTION_GOOD,index.1,PROCESSED_TEXT
0,0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,10210310/161116/0016793,3528006408,ПИЛОМАТЕРИАЛЫДОСКА ЕЛОВАЯ(PICEA ABIES) ОБРЕЗНА...,0,пиломатериалыдоска еловый picea abies обрезная...
1,1,АО ПИКАЛЁВСКАЯ СОДА,10210200/080218/0004332,4715022874,СУЛЬФАТ КАЛИЯ (КАЛИЙ СЕРОНОКИСЛЫЙ) ДЛЯ ТЕХНИЧЕ...,1,сульфат калий калий серонокислый технический к...
2,2,ООО ИНТЕР-ТРАНС,10412110/140817/0008251,6324057625,ШВЕЛЛЕР БАМПЕРА ДЛЯ А/М ВАЗ 21900280301501 6 ШТ.,2,швеллер бампер ваз 21900280301501 шт
3,3,ООО ЗЕЛЕНЫЙ СВЕТ,10607040/310718/0012337,3849029492,ЛЕСОМАТЕРИАЛЫ Х/ПРАСПИЛЕННЫЕ ВДОЛЬНЕСТРОГАНЫЕН...,3,лесоматериал праспиливать вдольнестроганыенелу...
4,4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,10508010/010217/0001637,6608003013,ТЕЛЕФОННАЯ ПРОВОДНАЯ ТРУБКА В СБОРЕ ДЛЯ СВЯЗИ ...,4,телефонный проводной трубка сбор связь бортпро...


In [27]:
# drop DESCRIPTION_GOOD column, no longer necessary now that PROCESSED_TEXT column is present
# no longer need declaration number at this point in analysis
df_merge = df_merge.drop(columns=['DECLARATION_NUMBER', 'DESCRIPTION_GOOD', 'index'])

In [28]:
# check results
df_merge.head()

Unnamed: 0,CONSIGNOR_NAME,CONSIGNOR_INN,PROCESSED_TEXT
0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,3528006408,пиломатериалыдоска еловый picea abies обрезная...
1,АО ПИКАЛЁВСКАЯ СОДА,4715022874,сульфат калий калий серонокислый технический к...
2,ООО ИНТЕР-ТРАНС,6324057625,швеллер бампер ваз 21900280301501 шт
3,ООО ЗЕЛЕНЫЙ СВЕТ,3849029492,лесоматериал праспиливать вдольнестроганыенелу...
4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,6608003013,телефонный проводной трубка сбор связь бортпро...


In [29]:
# Here the 'text' variable is defined.  This step is necessary to faciliate feeding to the TFIDF Vectorizer, as TfidfVectorizers are trained or arrays.
# For the purposes of this project, the 'PROCESSED_TEXT' column of train dataset will be used for training.
text = df_merge['PROCESSED_TEXT']

### Define Vectorizer and Set Parameters
The vectorizer will convert each item in the 'text' list into a series of vectors.  These vectors will be fed to the KMeans
model in the next notebook.

The parameters for the vectorizer can be explained as follows:
- max_df = 0.90 means "ignores terms that appear in more than 90% of the documents".  We chose this threshold to ignores reduntant terms.
- min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".  We chose this threshold so our final model wouldn't be too specific.
- ngram_range =( 1,2) means consider words by pairs and individual. We felt this was an appropriate corpus length to consider for our analysis

Before fitting the vectorizer on our data, we converted every entry in the 'text' array from Russian to unicode (`text.values.astype('U')`).  This was necessary for the vectorizer to work.

The resulting vectoizer converted the entire text corpus to 301 columns of vectorized data.

In [32]:
# define vectorizer
#vectorizer = TfidfVectorizer(tokenizer=tokenize, min_df=0.01, max_df=.90, ngram_range=(1,2))
vectorizer = TfidfVectorizer(min_df=0.01, max_df=.90, ngram_range=(1,2)) # running without tokenizer and spacy

In [33]:
#fit_transform text with vectorizer
#Converted to Unicode because it will run into an np.nan error. This need to be turned into a unicode string.
sparse = vectorizer.fit_transform(text.values.astype('U'))

### Issues With Vectorization
Vectorization is a computationally-expensive process, which creates a tradeoff between accuracy and scalability.  The more accurate we want our vectorizer to be, the more data it would have to consider, and more data means more computing effort.  The more smooth/streamlined we want our process, the more we'd have to limit our data, and limiting data means less-accurate results.

We attempted to run the vectorizer with more expansive parameters, such as `TfidfVectorizer(min_df=0.0025, max_df=.95, ngram_range=(1,3))`. Instead of 301 columns, this vectorizer would create a dataframe with 1200-1300 columns.  Such a dataset was far too large for our SageMaker instance to process by itself; we'd have to incorporate cluster computing and batch processing into our vectorization.  This would have been great, as a larger vectorizer more accurately captures corpus text.  Unfortunately we could not accomplish this in the timeframe of our project.

In [34]:
sparse

<8312006x301 sparse matrix of type '<class 'numpy.float64'>'
	with 56072965 stored elements in Compressed Sparse Row format>

In [35]:
sparse

<8312006x301 sparse matrix of type '<class 'numpy.float64'>'
	with 56072965 stored elements in Compressed Sparse Row format>

In [36]:
# We then get the vectorizer's feature names for use as our dataframe's column headers
dtm = pd.DataFrame(sparse.todense(), columns=vectorizer.get_feature_names())

In [37]:
# View Feature Matrix as DataFrame
dtm.head()

Unnamed: 0,00,10,11,27,848686,88104см,90,946288,946388,abies,...,черновой,швейный,шина,шип,шлифовать,шт,электрический,элемент,этиловый,этиловый спирт
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433472,...,0.0,0.0,0.0,0.360739,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.6043,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.345262,0.0,0.0,0.0,0.0


In [38]:
# confirm correct shape of vectorized dataframe & training dataset
print("main dataframe shape:", df_merge.shape, '\n',"vectorized text dataframe shape:", dtm.shape)

main dataframe shape: (8312006, 3) 
 vectorized text dataframe shape: (8312006, 301)


In [39]:
# reset indices of both dataframes for merge
# not sure why we had to do this, but running the following three commands gave us the results we wanted
dtm = dtm.reset_index()
df_merge = df_merge.reset_index()
df_merge['index'] = df_merge.index
dtm['index'] = dtm.index

In [40]:
# merge vectorized word feature matrix with training dataset
df_merge_vector = pd.concat([df_merge, dtm], axis=1, join='inner')

In [41]:
# check shape of merged dataframe
df_merge_vector.shape

(8312006, 306)

In [42]:
# check merged vectorized dataframe
df_merge_vector.tail()

Unnamed: 0,index,CONSIGNOR_NAME,CONSIGNOR_INN,PROCESSED_TEXT,index.1,00,10,11,27,848686,...,черновой,швейный,шина,шип,шлифовать,шт,электрический,элемент,этиловый,этиловый спирт
8312001,8312001,АО Н ВОЛЬВО ВОСТОК,5032048798,часть механизм сцепление новый дорожностроител...,8312001,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8312002,8312002,ООО ТЕХНОНИКОЛЬ - СТРОИТЕЛЬНЫЕ СИСТЕМЫ,7702521529,материал кровельный нефтяной битум рулон вес с...,8312002,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8312003,8312003,ООО АЛЬКОР,2526002854,первичный пилотажнонавигационный жидкокристалл...,8312003,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8312004,8312004,ООО СУПРАМЕД-ЮГ,6164210350,средство чистка зуб,8312004,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8312005,8312005,ОАО ГРУППА ИЛИМ,7840346335,целлюлоза сульфатный беленый лиственный порода...,8312005,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
# drop index columns
df_merge_vector = df_merge_vector.drop(columns=['index'])

In [44]:
df_merge_vector.head()

Unnamed: 0,CONSIGNOR_NAME,CONSIGNOR_INN,PROCESSED_TEXT,00,10,11,27,848686,88104см,90,...,черновой,швейный,шина,шип,шлифовать,шт,электрический,элемент,этиловый,этиловый спирт
0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,3528006408,пиломатериалыдоска еловый picea abies обрезная...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.360739,0.0,0.0,0.0,0.0,0.0,0.0
1,АО ПИКАЛЁВСКАЯ СОДА,4715022874,сульфат калий калий серонокислый технический к...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ООО ИНТЕР-ТРАНС,6324057625,швеллер бампер ваз 21900280301501 шт,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.6043,0.0,0.0,0.0,0.0
3,ООО ЗЕЛЕНЫЙ СВЕТ,3849029492,лесоматериал праспиливать вдольнестроганыенелу...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,6608003013,телефонный проводной трубка сбор связь бортпро...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.345262,0.0,0.0,0.0,0.0


In [45]:
# saves dataframe to S3
df_merge_vector.to_csv('s3://labs20-arms-bucket/data/df_train_description_filtered_vectorizedIF2.csv')

In [46]:
# Once the fit_transform was completed on the 'text' list, we pickled the vectorizer for use in data exploration and our final product.

# create s3, bucket, and key objects
# 'key' object is the file you want to save
s3 = boto3.resource('s3')
bucket=s3.Bucket('labs20-arms-bucket')
key = "vectorizerf.pkl"

# WRITE/SAVE 'vectorizer' to s3 bucket
with tempfile.TemporaryFile() as fp:
    joblib.dump(vectorizer, fp)
    fp.seek(0)
    bucket.put_object(Body=fp.read(), Key=key)

In [47]:
# test READ/LOAD of vectorizer from S3 bucket
with tempfile.TemporaryFile() as fp:
    bucket.download_fileobj(Fileobj=fp, Key=key)
    fp.seek(0)
    vectorizer_load = joblib.load(fp)

In [48]:
# Confirming vectorizer saved to S3 bucket is the same as vectorizer created in notebook
vectorizer_load

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.9, max_features=None,
                min_df=0.01, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [49]:
# Confirming vectorizer saved to S3 bucket is the same as vectorizer created in notebook
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.9, max_features=None,
                min_df=0.01, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [None]:
# check word vector columns for effectiveness of stopword removal
for i in df_merge_vector.columns:
    print(i)

### NEW COLUMNS (with stopwords filtered out):
00
10
11
27
848686
88104см
90
946288
946388
abies
astm
betula
larix
larix sibirica
picea
picea abies
pinus
pinus sylvestris
sibirica
sibirica распиливать
sylvestris
sylvestris распиливать
автомобиль
арт
ассортимент
белый
береза
березовый
битуминозный
бревно
брутто
бумага
бытовой
ваз
верхний
верхом
вес
вес брутто
ветеринария
вещество
взрослый
влажность
внутренний
вода
военный
военный назначение
волокно
вязание
газ
год
горячекатаный
гост
гост 848686
гост 946288
гост 946388
гр
гражданский
грудь
грудь 88104см
дальнейший
двигатель
диаметр
дл
длина
добавка
добавление
доля
дополнение
доска
древесина
ель
емкость
железо
женский
женский швейный
жидкий
жир
задний
запасной
запасной часть
изготавливать
изготовление
изделие
иметь
иметь соединение
использоваться
какао
камаз
качество
класс
класс люкс
клееный
кожа
комплект
консервант
консервант сорт
контракт
кора
корпус
круглый
легковой
легковой автомобиль
лекарственный
лекарственный средство
лесоматериал
лесоматериал хвойный
лист
лиственница
лом
лущеный
люкс
мазут
максимальный
марка
мас
масло
массовый
массовый доля
материал
машина
машинный
машинный вязание
менее
металл
металлический
метод
метод astm
механический
мешок
минеральный
модель
моторный
мужской
назначение
напряжение
наркотический
наркотический сильнодействующий
небрусовать
нелегированный
необработанный
необработанный консервант
необтесанный
необтесанный нешлифованный
неокоренный
неокоренный необработанный
нестрагивать
нестроганый
нетто
нефтяной
нешлифованный
нешлифованный иметь
нить
новый
номинальный
номинальный длина
оборудование
обрабатывать
обработка
обрезной
обрезный
обтесывать
обувь
обхват
обхват грудь
общегражданский
общегражданский назначение
общий
объем
обыкновенный
обыкновенный pinus
одежда
ооо
основа
отсутствовать
перегоняться
пиловочник
пиломатериал
питание
пихта
пищевой
пластиковый
пластмасса
плита
плоский
плотность
пневматический
пневматический резиновый
поверхность
подошва
покрытие
полимерный
полимерный материал
порода
порода сосна
предельный
предназначать
представлять
применение
применяться
применяться ветеринария
припуск
продажа
продажа содержать
продукт
продукция
производство
прокат
пруток
пряжа
распиливать
расфасовывать
расфасовывать розничный
резина
резиновый
резиновый новый
ремонт
розничный
розничный продажа
рост
рулон
сахар
сбор
свежий
сгорание
сера
сечение
сибирский
сильнодействующий
синтетический
система
слой
смесь
согласно
содержание
содержание сера
содержать
содержать наркотический
соединение
соединение шип
сорт
сорт гост
сосна
сосна обыкновенный
состав
состоять
спирт
сплав
способ
средство
средство расфасовывать
сталь
стальной
стекло
стелька
сто
строительный
строительный цель
строительство
сухой
текстильный
температура
техника
технический
товар
толщина
топливо
топливо жидкий
торец
торцевой
торцевой соединение
трикотажный
трикотажный машинный
углерод
упак
упаковывать
употребление
урожай
устройство
учет
фанера
форма
хвойный
хвойный порода
химический
химический нить
хлопчатобумажный
хлопчатобумажный пряжа
цвет
цель
часть
часть ваз
черновой
швейный
шина
шип
шлифовать
шт
электрический
элемент
этиловый
этиловый спирт

### OLD COLUMNS (before stopword removal): 
0
1
1 шт
1.27
1.27 2011.10631
10
100
12
13
14
15
16
18
2
20
2011.10631
25
250
3
30
4
4 м
40
5
50
6
60
65
70
8
80
848686
85
88104
88104 см
946288
946388
abies
astm
betula
d
l
l распиливать
larix
larix sibirica
picea
picea abies
pinus
pinus sylvestris
sibirica
sylvestris
sylvestris l
автомобиль
алюминиевый
арт
ассортимент
б
белый
березовый
берёза
битуминозный
бревно
брутто
брутто поддон
бумага
бытовой
ваза
вдоль
вдоль нестрагивать
верхний
верхом
вес
вес брутто
ветеринария
вещество
взрослый
вид
влажность
внутренний
вода
военный
военный назначение
волокно
вулканизовать
вязание
г
газ
горячекатаный
гост
гост 848686
гост 946288
гост 946388
гр
гражданский
грудь
грудь 88104
далее
дальнейший
двигатель
диаметр
дл
длина
добавка
добавление
доля
дополнение
доска
древесина
ель
емкость
железо
женский
женский швейный
жидкий
жир
задний
запасной
запасной часть
изготавливать
изготовление
изделие
иметь
иметь соединение
использование
использоваться
какао
камаз
качество
кг
класс
класс люкс
клееный
кожа
количество
комплект
консервант
контракт
кора
коробка
корпус
круглый
л
легковой
легковой автомобиль
лекарственный
лекарственный средство
лесоматериал
лесоматериал хвойный
лист
лиственница
лом
лущеный
люкс
м
м ваза
м камаз
м2
м3
мазут
максимальный
марка
мас
масло
масса
массовый
массовый доля
материал
машина
машинный
машинный вязание
менее
металл
металлический
метод
метод astm
механический
мешок
минеральный
мм
модель
моторный
мужской
назначение
напряжение
наркотический
наркотический сильнодействующий
небрусовать
нелегированный
нелегированный становиться
необработанный
необработанный консервант
необтесанный
необтесанный нешлифованный
неокоренный
неокоренный необработанный
нестрагивать
нестроганый
нетто
нефтяной
нешлифованный
нешлифованный иметь
нить
новый
номинальный
номинальный длина
оборудование
обрабатывать
обработка
обрезной
обрезный
обтёсывать
обувь
обхват
обхват грудь
общегражданский
общегражданский назначение
общий
объем
обыкновенный
обыкновенный pinus
одежда
ооо
основа
отсутствовать
п
п сосна
перегоняться
пиловочник
пиломатериал
питание
пихта
пищевой
пластиковый
пластмасса
плита
плоский
плотность
пневматический
пневматический резиновый
поверхность
поддон
подошва
покрытие
полимерный
полимерный материал
получать
порода
порода сосна
предельный
предназначать
представлять
применение
применяться
применяться ветеринария
припуск
продажа
продажа содержимый
продукт
продукция
производство
прокат
прочий
прочий цель
пруток
пряжа
р
размер
распиливать
распиливать вдоль
расфасовывать
расфасовывать розничный
расфасовывать упаковка
резина
резиновый
резиновый новый
ремонт
розничный
розничный продажа
рост
рулон
сахар
сбор
сгорание
сера
сечение
сибирский
сильнодействующий
синтетический
система
слой
см
см дополнение
см обхват
смесь
согласно
содержание
содержание сера
содержимый
содержимый наркотический
соединение
соединение шип
сорт
сорт гост
сосна
сосна обыкновенный
состав
состоять
спирт
сплав
способ
средство
средство расфасовывать
сталь
стальной
становиться
стекло
стелька
сто
строительный
строительный цель
строительство
сухой
т
текстильный
температура
техника
технический
тип
товар
толщина
топливо
топливо жидкий
торец
торцевой
торцевой соединение
трикотажный
трикотажный машинный
углерод
упак
упаковка
упаковка розничный
упаковывать
устройство
учет
фанера
форма
фр
фр 1.27
х
х п
хвойный
хвойный порода
химический
химический нить
хлопчатобумажный
хлопчатобумажный пряжа
цвет
цель
часть
часть м
черновой
черный
черный металл
швейный
шина
шина пневматический
шип
шлифовать
шпон
шт
электрический
элемент
этиловый
этиловый спирт
являться
№