# Notebook 4 - Text Processing/Predicting Function
The predictor_function, which analyzes a dataframe's text column for similarities with known arms exporters, is housed in this notebook.  Throughout, markdown kernels will explain the steps and functionality contained in each portion of the notebook.

### PIP Installations
The following pip installations are required to get the predictor_function to work properly.

In [None]:
#pip installations - necessary to get notebook to run
#update dask
!pip install --upgrade pip
!pip install dask==2.4.8
!pip install fsspec
!pip install --upgrade s3fs
!pip install numpy
!pip install pymystem3
!pip install spacy
!pip install joblib
!pip install pymorphy2==0.8
!pip install dask_ml

### Imports
Imports from the first three notebooks, as well as additional imports 'euclidian' & 'pdist', are housed here.  Euclidian and pdistance are statistical measurements.

In [2]:
# IMPORTS

# dataframe
import dask.dataframe as dd
import pandas as pd

# DESCRIPTION_GOOD preprocessing
import nltk
nltk.download("stopwords")
#--------#
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation


# machine learning/analysis
import dask_ml.cluster as dask_ml_model # sklearn's skmeans took up too much memory to run.

# measuring euclidian distance
from scipy.spatial.distance import euclidean, pdist

# S3 bucket interaction
import tempfile
import boto3
import joblib

# Disable warning message related to SettingWithCopyWarning
# displays when running final function otherwise
pd.options.mode.chained_assignment = None     # default = 'warn'

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




### Functions, Classes, and Stopwords
Functions, classes, and stopwords from notebooks 2 and 3 are housed here.  The similarity_func is also housed here, which will be used as a part of the function's final output

In [3]:
# define stemmer and Russian stopwords for data preprocessing
mystem = Mystem() 
russian_stopwords = stopwords.words("russian")
# https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist
# add trade-specific stopwords to list
newStopWords = ['г', '№', '10', '1', '20', '30', 'кг', '5', 'см',
                '100', '80', '2', 'х', 'l', 'м', '00', '000'
                '1.27', '2011.10631', '4', '12', '3', 'фр', 'количество',
                'становиться', 'мм', 'вид', 'упаковка', 'получать',
                'прочий', 'использование', 'масса', 'размер', 'черный',
                '6', '8', '7', '50', '40', '25', 'коробка', 'поддон',
                'вдоль', '250', '65', '85', '15', '35', '40', '45',
                '55', '60', '70', '75', 'м3', '13', '0', '14',
                '16', '18', 'm2', 'п', 'р', 'т', 'тип', 'являться',
                'размер', 'cm', 'm', '01', '02', '03', '04', '05',
                '06', '07', '08', '09', '24', '27']
russian_stopwords.extend(newStopWords)

#define function for preprocessing text - to be used later in notebook
#function will remove Russian stop words and any punctuation not removed in cleaning_trade_data_desc_kmeans.ipynb
def preprocess_text(text):
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
        and token != " " \
        and token.strip() not in punctuation]
    text = " ".join(tokens)
    return text

# similarity function for euclidian measure at end of main function
def similarity_func(u, v):
    return 1/(1+euclidean(u,v))

Installing mystem to /home/ec2-user/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz


### S3 Imports & Known Russian Arms Exporter Definition
For the function to work, the vectorizer, model, and listArmscluster (armsclustersf.csv) need to be loaded into the notebook from the S3 Bucket.  Additonally, the list of known arms exporters needs to be copied over to notebook.

In [4]:
#load model, vectorizer, and tokenizer to notebook
s3 = boto3.resource('s3')
bucket=s3.Bucket('labs20-arms-bucket')

# load vectorizer from S3 bucket
key = "vectorizerf.pkl"
with tempfile.TemporaryFile() as fp:
    bucket.download_fileobj(Fileobj=fp, Key=key)
    fp.seek(0)
    vectorizer = joblib.load(fp)

# load model from S3 bucket
key = "modelf.pkl"
with tempfile.TemporaryFile() as fp:
    bucket.download_fileobj(Fileobj=fp, Key=key)
    fp.seek(0)
    model = joblib.load(fp)

#load cluster dataset from S3 bucket
# drop error column accidentally created in import
clusters = pd.read_csv('s3://labs20-arms-bucket/data/armsclustersf.csv')
clusters = clusters.drop([clusters.columns[0]], axis='columns')

# list of known arms exporters
inn_arms_exp_total = ['7718852163',  '7740000090',    '7731084175',  '6161021690',
                      '3807002509',  '6672315362',    '7802375335',  '7813132895',  
                      '7731280660',  '7303026762',    '5040007594',  '2501002394',  
                      '7807343496',  '7731559044',    '5042126251',  '7731595540',    
                      '7733018650',  '7722016820',    '7705654132',  '7714336520',    
                      '7801074335',  '6229031754',    '7830002462',  '6825000757',  
                      '5043000212',  '7802375889',    '5010031470',  '1660249187',  
                      '7720015691',  '6154573235',    '5038087144',  '7713006304',  
                      '7805326230',  '5023002050',    '4007017378',  '7714013456',  
                      '17718852163', '7811406004',    '7702077840',  '7839395419',  
                      '7702244226',  '7704721192',    '7731644035',  '7712040285',
                      '7811144648',  '4345047310',    '7720066255',  '6607000556',
                      '1832090230',  '1835011597',    '3305004083',  '4340000830',
                      '5074051432',  '1841015504',    '7105008338',  '7106002829', 
                      '7704274402',  '5942400228',    '7105514574',  '5012039795', 
                      '7714733528',  '3904065550',    '6825000757',  '7807343496', 
                      '7731559044',  '7805231691',    '7704859803',  '0273008320',
                      '7704274402',  '2902059091',    '7805034277',  '7727692011',
                      '7733759899',  '6154028021',    '7328032711',  '2635002815',
                      '5040097816',  '5027033274',    '5250018433',  '5200000046',
                      '7743813961',  '7718016666',    '5047118550',  '7704274402']

In [5]:
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.9, max_features=None,
                min_df=0.01, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [6]:
model

KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
       max_iter=300, n_clusters=10, n_jobs=1, oversampling_factor=2,
       precompute_distances='auto', random_state=None, tol=0.0001)

### Removing clusters from final analysis/product
Since clusters 2, 3, 4, 5, 7, 8, and 9 have little representation in known arms exporter dataset (account for less than 3% of total trades combined), we are removing them from the final similarity calculation.  The first iteration of our product assigned similarity scores that were too high that had low numbers of trades in these clusters.  This is incorrect, and removing them yielded much better final results.

In other words, similarity will only be measured against clusters 0, 1, and 6.

Removed clusters can be adjusted via the function's `cluster_columns` variable.

In [7]:
clusters

Unnamed: 0,CONSIGNOR_INN,clust0,clust1,clust2,clust3,clust4,clust5,clust6,clust7,clust8,clust9
0,known_AE_trade_count,55151,9961,81,42,13,23,8723,1356,348,0


In [8]:
### TEST ###
#read df_trade_desc_processed_testIF2 for function test
df = pd.read_csv('s3://labs20-arms-bucket/data/df_trade_desc_processed_testIF2.csv',dtype={'CONSIGNOR_INN': 'str'})

In [87]:
# check dataframe size
# running test on dataframe containing 2,070,002 rows
df.shape

(2078002, 5)

In [89]:
# test dataframe contains 32,185 unique INNS
df['CONSIGNOR_INN'].nunique()

32185

In [11]:
# check dataframe
df.head()

Unnamed: 0,CONSIGNOR_NAME,DECLARATION_NUMBER,CONSIGNOR_INN,DESCRIPTION_GOOD
0,ООО МАГНАТ,10607040/220817/0014872,3808198484,ЛЕСОМАТЕРИАЛЫРАСПИЛЕННЫЕ ВДОЛЬ Х/П ЛИСТВЕННИЦА...
1,ООО ТК ВЕСТА,10611020/250918/0025950,2465310231,ПИЛОМАТЕРИАЛЫ Х/П ЕЛЬ СИБИРСКАЯ PICEA OBOVATA ...
2,ООО ТЕХНОНИКОЛЬ - СТРОИТЕЛЬНЫЕ СИСТЕМЫ,10103110/060516/0009957,7702521529,ТЕПЛОИЗОЛЯЦИОННЫЕ ПЛИТЫ ПОРИСТЫЕ ИЗ ЭКСТРУЗИОН...
3,ЗАО ЭНЕРГОСТРОЙМОНТАЖ,10210350/170317/0004957,7813112708,КЛАПАНЫ ОБРАТНЫЕ ПОВОРОТНЫЕ ОДНОДИСКОВЫЕ ИЗГОТ...
4,ООО КУПИШУЗ,10113110/040418/0043451,7705935687,РУБАШКА МУЖСКАЯ ШЕРСТЯНАЯ ТРИКОТАЖНАЯ НЕ КЛАСС...


### Predictor Function
The predictor function takes a Russian trade dataframe as an input and returns a dataframe of INNs whose trades' 'DESCRIPTION' columns are similar to those of known arms exporters.  The final threshold uses inverted euclidian distance and pdistance to measure how similar two sets of numbers are.  

In short, the function compares the total trades per cluster ratios of known arms exporters against every INN in the input dataframe, and if the similarity between the ratios is above a certain threshold, that INN makes it into the final dataframe.  For example, if INN 'A' has 90% similar ratios to known arms exporters and the threshold is 65%, INN 'A' will be displayed in the function's final output.  If INN 'B' has 40% similar ratios to known arms exporters and the threshold is 65%, INN 'A' will not be displayed in the function's final output.

#### Inputs:
**df (required)** = input dataframe.  df must be a pandas dataframe* containing Russian trade data.  Must contain one company identifier and one trade description column

**name_column (default = 'CONSIGNOR_NAME')** = name column to be used in function. Default is 'CONSIGNOR_NAME' because model was trained on dataset whose name column was 'CONSIGNOR_NAME'

**id_column (default = 'CONSIGNOR_INN')** = name column to be used in function. Default is 'CONSIGNOR_INN' because model was trained on dataset whose id column was 'CONSIGNOR_INN'

**text_column (default = 'DESCRIPTION_GOOD')** = name column to be used in function. Default is 'DESCRIPTION_GOOD' because model was trained on dataset whose text column was 'DESCRIPTION_GOOD'

**invalid_id_terms (default = ['None', '00', 'ИНН/КПП НЕ О', '0'])** = invalid terms present in id column.  Will be used to clean dataset entries with invalid id numbers

**min_trades (default = 35)** = minimum trades per INN to be considered by the function.  If a particular INN only has one or two trades in the input dataframe, the vectorized/predicted results might not be as accurate.  Typically the higher the min_trades number, the fewer entries there will be in the output dataframe

**profile_similarity_threshold (default = .65)** = threshold for a given INN to appear in output dataframe.  If an INN's similarity score is below this threshold, it will not appear in the output dataframe.  If an INN's similarity score is above this threshold, it will appear in the output dataframe

**cluster_columns = (default = ['clust0', 'clust1', 'clust6'])** = clusters of interest identified by the KMeans model.  Way of filtering out clusters that had little to no representation amongst known arms exporters.  Makes for a more accurate model.  Do not update unless `cluster` variable is also updated

In [57]:
def process_predictor_function(df, name_column = 'CONSIGNOR_NAME', id_column = 'CONSIGNOR_INN', text_column = 'DESCRIPTION_GOOD',
                              invalid_id_terms = ['None', '00', 'ИНН/КПП НЕ О', '0'], min_trades=35, profile_similarity_threshold = .65,
                              cluster_columns = ['clust0', 'clust1', 'clust6']):
    """
    function to clean INNs of input dataframe and return Russian arms exporter similarity calculation
    
    """
    try:
        # set column variable
        # reduce dataframe so that dataframe only contains columns in columns variable
        df = df[[name_column, id_column, text_column]]
        
        # remove rows from dataset containing INNs of known arms exporters
        # check 'INN' column against inn_arms_exp_total list, drop row if there's a match with the list
        df = df[~df[id_column].isin(inn_arms_exp_total)]
        
        # clean INNs
        # Create subslice of dataframe for dictionary
        dict_df = df[[name_column, id_column]]
        # clean columns of dict_df, remove invalid_id_terms from CONSIGNOR_INN column
        invalid_id_terms = invalid_id_terms
        for term in invalid_id_terms:
            dict_df = dict_df[dict_df[id_column] != term]
        # drop all null values
        dict_df.dropna(inplace=True)
        # sort values by 'CONSIGNOR_NAME'
        dict_df.sort_values(name_column, inplace = True) 
        # dropping ALL duplicte 'CONSIGNOR_NAME' values from dictionary
        dict_df.drop_duplicates(subset =name_column, keep = 'first', inplace = True) 
        # create list of 2-item lists: [CONSIGNOR_NAME, CONSIGNOR_INN]
        new_list = dict_df.values.tolist()
        # create dictionary out of list of lists
        # for every list in the list of lists, take the first item in list (CONSIGNOR_NAME)
        # and add it to index position of dictionary, take second term ('CONSIGNOR_INN') and add it to value position of dictionary
        # cannot use pandas.to_dict() because it adds column names to dictionary; only want indexes/values
        new_dict = {t[0]:t[1] for t in new_list}
        # map new_dict to 'CONSIGNOR_INN' column of main dataframe
        df[id_column] = df[name_column].map(new_dict)

        # drop null values
        df.dropna(inplace=True)
        
        # remove all rows from list whose total INN count is less than min_trades variable
        # way to limit size before processing, weed out INNs that only have a few trades present in dataset
        df = df[df.groupby(id_column)[id_column].transform('size') >= min_trades]
        
        #create list for preprocessed text to be appended to
        processed_text_list = []
        
        #this is the alg to apply preprocessing function to text column
        # removed print statement from David's function
        for i in range(len(df[text_column])):
            x = df[text_column].iloc[i]
            if isinstance(x, str):
                processed_text_list.append(preprocess_text(x))
            else:
                processed_text_list.append(preprocess_text(x.astype(str)))
            
        # convert list of preprocessed text to dataframe
        # to be concatenated onto original dataframe
        df1 = pd.DataFrame({'PREPROCESSED_TEXT':processed_text_list})
        
        # reset indices of both dataframes
        df1 = df1.reset_index()
        df = df.reset_index()
        df['index'] = df.index
        
        # merge preprocessed text to original dataframe
        df_merge = pd.concat([df, df1], axis=1, join='inner')
        
        # drop DESCRIPTION_GOOD column, no longer necessary now that PROCESSED_TEXT column is present
        df_merge = df_merge.drop([text_column, 'index'], axis='columns')
        
        #define variable to feed to TFIDF Vectorizer - 'PROCESSED_TEXT' column of train dataset
        text = df_merge['PREPROCESSED_TEXT']
        
        #transform text with vectorizer
        #Converted to Unicode because it will run into an np.nan error. This need to be turned into a unicode string.
        sparse = vectorizer.transform(text.values.astype('U'))
        
        # Get feature names to use as dataframe column headers
        dtm = pd.DataFrame(sparse.todense(), columns=vectorizer.get_feature_names())
        
        # reset indices of both dataframes for merge
        # not sure why we had to do this, but running the following three commands gave us the results we wanted
        dtm = dtm.reset_index()
        df_merge = df_merge.reset_index()
        df_merge['index'] = df_merge.index
        dtm['index'] = dtm.index
        
        # merge vectorized word feature matrix with training dataset
        df_merge_vector = pd.concat([df_merge, dtm], axis=1, join='inner')
        # drop index columns
        df_merge_vector = df_merge_vector.drop(columns=['index'])
        
        # variable manipulation to feed into KMeans model
        # pull create variable containing dataframe of vectorized words only, all rows, columns indexed 4 and onward
        X = df_merge_vector.drop(columns=[name_column, id_column, 'PREPROCESSED_TEXT'])
        
        # convert X dataframe into array
        # necessary to feed to KMeans model
        X_array = X.values
        
        # fit model on vectorized word array
        labels = model.predict(X_array)
        
        # create 'cluster' column to add to vectorized dataframe
        #Glue back to originaal data
        df_merge_vector['cluster'] = labels

        # extract columns for final analysis
        Y = df_merge_vector[[id_column,'cluster']]
        
        # add column to dataframe for each cluster in model, created with copied values from 'cluster' column
        # create 1,0 boolean to check if number in cell is equal to number of cluster, assigns 1s and 0s accordingly
        # drop cluster column, no longer necessary now that we have count
        for i in range(model.n_clusters):
            Y['clust{}'.format(i)] = Y['cluster']
            Y['clust{}'.format(i)] = (Y['clust{}'.format(i)] == i) * 1
        
        # drop 'cluster' column, no longer necessary now that we have total trades per cluster per INN
        Y = Y.drop(columns=['cluster'])
        
        #create column_names variable to filter out CONSIGNER_INN from .groupby() in next step
        column_names = Y.drop(columns = [id_column]).columns.tolist()
        
        #create new dataframe totalling trades per cluster per INN
        Y = pd.DataFrame(Y.groupby([Y[id_column]])[column_names].sum()).reset_index()
        
        # add final tally for known arms exporters
        # reset index so known arms exporters are at bottom of dataframe, indexed properly
        Y = Y.append(clusters.iloc[0,1:], sort=None).reset_index().drop(columns=['index'])
        
        # convert all columns except for 'CONSIGNOR_INN' to decimals/percentages of total
        Y[column_names] = Y[column_names].div(Y[column_names].sum(axis=1), axis=0)
        
        # cluster columns
        # remove clusters with low percentages for known arms exporters from dataset
        cluster_columns = cluster_columns
        cluster_columns.insert(0, id_column)
        Y = Y[cluster_columns]
        
        # similarity matrix - create list of p-distance scores using pdistance & euclidian distance
        # simply put, it measures how similar two sets if numbers are
        # https://stackoverflow.com/questions/35758612/most-efficient-way-to-construct-similarity-matrix
        # each row in dataframe will be compared against the bottom row of the dataframe, which contains the totals for knowns arms exporters
        pscores=[]
        for i in range(len(Y)):
            x = pdist([Y.iloc[-1, 1:],Y.iloc[i, 1:]], similarity_func)[0]
            pscores.append(x)
        
        # add pdist_score column to Y dataframe
        Y['pdist_score'] = pscores
        
        # drop control row (known arms exporters totals)
        Y = Y.drop(Y.index[-1])
        
        # create profile_similarity_threshold variable
        # if INN's pdist_score >= profile_similarity_threshold, INN will be included in final dataframe
        # if INN's pdist_score < profile_similarity_threshold, INN will not be included in final dataframe
        Y = Y[Y['pdist_score'] >= profile_similarity_threshold]
        
        #generate dataframe
        return Y
        
    except:
        
        raise

### Test
Tested function on dataframe containing 2 million rows of trade data for 32,185 unique INNs.  Our function successfully analyzed these INNs, and based on the default profile similarity threshold of 65% reduced the dataset to 3,777 INNs of interest.  The text patterns of their trades' descriptions fall into a similar clustering pattern as know Russian arms exporters accorting to our model

In [66]:
### TEST ###
# test = process_predictor_function(df)
test.sort_values(by='pdist_score', ascending=True).tail(25)

Unnamed: 0,CONSIGNOR_INN,clust0,clust1,clust6,pdist_score
1275,3123138830,0.775862,0.086207,0.137931,0.935134
2941,610210391297,0.726829,0.063415,0.121951,0.935869
2951,6119005430,0.723735,0.07393,0.14786,0.937712
3285,6166093748,0.773333,0.133333,0.066667,0.93802
3332,6167129059,0.666667,0.115226,0.123457,0.939363
5093,7814413641,0.755682,0.079545,0.142045,0.939392
5180,7839447850,0.761468,0.082569,0.091743,0.940256
2996,6145001168,0.78125,0.109375,0.109375,0.945649
2496,5075018950,0.694074,0.175451,0.12134,0.94685
3190,6164021988,0.692308,0.137652,0.072874,0.946891


In [59]:
test.shape

(3777, 5)

In [60]:
test.nunique()

CONSIGNOR_INN    3777
clust0           2256
clust1           1170
clust6           1406
pdist_score      2963
dtype: int64

### Conclusions & Recommendations for Future Groups
**Conclusions:**
- too much weight is given to cluster0.  Almost half the trade text in the training dataset were grouped into cluster0.  This skews the similarity score as well, as the majority of known arms exporter trades fall into cluster0

**Recommendations:**
- expand list of known arms exporters
- add list of known non-arms exporters, will reduce CPU requirements of running function and improve results
- explore hyperparameter tuning of KMeans model
- explore other potential clustering/machine learning models
- incorporate a larger vectorizer into the model. 301 columns does not fully capture the descriptions of 8 million trades
- incorporate batch processing into model/vectorizer training.  Will allow for more accurate model, and alleviate memory issues Labs20 group encountered
- incorporate more data!  Our model only focused on `NAME`, `ID`, and `TEXT` columns.  Analyzing more columns from the original dataset should yield a more accurate model