# Notebook 3 - KMeans Clustering & Exploratory Analysis
Once vectorization is complete, a KMeans clustering model is used to group similar bodies of text together.  Once grouped, we totaled the number of trades per cluster per INN (Russian Tax ID number).   Next, we totaled the number of trades per cluster for all known Russian Arms Exporters, and converted those totals to ratios, where each ratio represents the percentage of trades that fall into that cluser.

We are building a 'profile' for known Russian arms exporters.  Based on analysis of the text content of individual trades, known Russian arms exporters can be assigned to  clusters.  Our final product will use this known arms exporter profile to compare against new trade data.  It aims to answer the questions: how many companies trade similar products/in a similar manner as known arms exporters, and how similar are they?

It assigns a 'similarity score' for each INN in the dataset using a measure of inverse euclidian distance.  In simpler terms, it checks how 'similar' each group of percentages is to the percentages of all known Russian arms exporters.

In [None]:
# required pip installations
!pip install --upgrade pip
!pip install joblib
!pip install --upgrade s3fs
!pip install googletrans

# for dask machine learning cluster
!pip install dask-ml
!pip install executor

In [None]:
# IMPORTS

# dataframe
import dask.dataframe as dd
import pandas as pd

# machine learning/analysis
#from sklearn.cluster import KMeans
import dask_ml.cluster as dask_ml_model # sklearn's skmeans took up too much memory to run.
from sklearn.model_selection import train_test_split

# saving model to S3 bucket
import tempfile
import boto3
import joblib

# translate results
from googletrans import Translator
#import executor as e

In [3]:
# read df_trade_desc_filt_vector for nlp
df = dd.read_csv('s3://labs20-arms-bucket/data/df_train_description_filtered_vectorizedIF2.csv',dtype={'CONSIGNOR_INN': 'object'})
#                 dtype={'CONSIGNOR_INN': 'str', 'Unnamed: 0': 'str'}, usecols=range(0, 312))

In [4]:
df = df.drop(columns=['Unnamed: 0'])
# remove all ',' from CONSIGNOR_INN column
# somehow was missed in REGEX filter from cleaning_trade_data notebook
#df['CONSIGNOR_INN'] = df['CONSIGNOR_INN'].str.replace(',', '')

In [5]:
df.head()

Unnamed: 0,CONSIGNOR_NAME,CONSIGNOR_INN,PROCESSED_TEXT,00,10,11,27,848686,88104см,90,...,черновой,швейный,шина,шип,шлифовать,шт,электрический,элемент,этиловый,этиловый спирт
0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,3528006408,пиломатериалыдоска еловый picea abies обрезная...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.360739,0.0,0.0,0.0,0.0,0.0,0.0
1,АО ПИКАЛЁВСКАЯ СОДА,4715022874,сульфат калий калий серонокислый технический к...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ООО ИНТЕР-ТРАНС,6324057625,швеллер бампер ваз 21900280301501 шт,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.6043,0.0,0.0,0.0,0.0
3,ООО ЗЕЛЕНЫЙ СВЕТ,3849029492,лесоматериал праспиливать вдольнестроганыенелу...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,6608003013,телефонный проводной трубка сбор связь бортпро...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.345262,0.0,0.0,0.0,0.0


In [6]:
# read dataframe to memory
# allows pandas operations to be performed
df = df.compute()

In [7]:
type(df)

pandas.core.frame.DataFrame

### Feeding Data Into the Model
Similar to the TfidfVectorizer, our dask_ml_model.KMeans model object must be trained on an array.  In our inital analysis we did not perform hyperparameter tuning, and selected 10 clusters for our KMeans cluster as a default.  Once the array was created, we fit the model on the array, and extracted model.labels_ to use as our cluster names, and added the results to our dataframe.

In [10]:
# variable manipulation to feed into KMeans model
# pull create variable containing dataframe of vectorized words only, all rows, columns indexed 4 and onward
X = df.iloc[:, 3:]

In [11]:
# check dataframe to confirm its columns only contain word vectors
X.head()

Unnamed: 0,00,10,11,27,848686,88104см,90,946288,946388,abies,...,черновой,швейный,шина,шип,шлифовать,шт,электрический,элемент,этиловый,этиловый спирт
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433472,...,0.0,0.0,0.0,0.360739,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.6043,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.345262,0.0,0.0,0.0,0.0


In [12]:
# check dataframe to confirm its columns only contain word vectors
X.shape

(8312006, 301)

In [13]:
# convert X dataframe into array
# necessary to feed to KMeans model
X_array = X.values

### KMeans Model Selection
Because sklearn's KMeans module used too much memory, Labs20 group decided to use dask_ml_model from dask.clusters library.  This is because the default initializer for dask_ml_model.KMeans is `k-means||`, compared to `k-means++` from scikit-learn.  `k-means||` is designed to work well in a distributed environment such as SageMaker, whereas `k-means++` reads everything to memeory at once and can cause errors.

In [14]:
# define KMeans model
# n_jobs = -1 tells model to use all available processors
model = dask_ml_model.KMeans(n_clusters=10)

In [15]:
# fit model on vectorized word array
model.fit(X_array)

KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
       max_iter=300, n_clusters=10, n_jobs=1, oversampling_factor=2,
       precompute_distances='auto', random_state=None, tol=0.0001)

In [16]:
# Once the model was trained on our array of word vectors, we pickled it to our S3 bucket for use in our final product.
s3 = boto3.resource('s3')
bucket=s3.Bucket('labs20-arms-bucket')
key = "modelf.pkl"

# WRITE/SAVE 'model' to s3 bucket
with tempfile.TemporaryFile() as fp:
    joblib.dump(model, fp, compress=3)
    fp.seek(0)
    bucket.put_object(Body=fp.read(), Key=key)

In [17]:
# test READ/LOAD of model from S3 bucket
with tempfile.TemporaryFile() as fp:
    bucket.download_fileobj(Fileobj=fp, Key=key)
    fp.seek(0)
    model_load = joblib.load(fp)

In [18]:
# Confirming model saved to S3 bucket is the same as model created in notebook
model_load

KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
       max_iter=300, n_clusters=10, n_jobs=1, oversampling_factor=2,
       precompute_distances='auto', random_state=None, tol=0.0001)

In [19]:
# Confirming vmodel saved to S3 bucket is the same as model created in notebook
model

KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
       max_iter=300, n_clusters=10, n_jobs=1, oversampling_factor=2,
       precompute_distances='auto', random_state=None, tol=0.0001)

In [20]:
# Define model labels in variable 'labels'
labels = model.labels_

#Glue back to originaal data
df['cluster'] = labels

# reduce dataframe to necessary columns, no longer need text now that we have clusters
df = df[['CONSIGNOR_NAME', 'CONSIGNOR_INN', 'PROCESSED_TEXT', 'cluster']]
df.head()

Unnamed: 0,CONSIGNOR_NAME,CONSIGNOR_INN,PROCESSED_TEXT,cluster
0,АОЧЕРЕПОВЕЦКИЙ ФАНЕРНО-МЕБЕЛЬНЫЙ КОМБИНАТ,3528006408,пиломатериалыдоска еловый picea abies обрезная...,5
1,АО ПИКАЛЁВСКАЯ СОДА,4715022874,сульфат калий калий серонокислый технический к...,1
2,ООО ИНТЕР-ТРАНС,6324057625,швеллер бампер ваз 21900280301501 шт,0
3,ООО ЗЕЛЕНЫЙ СВЕТ,3849029492,лесоматериал праспиливать вдольнестроганыенелу...,5
4,ОАО АВИАКОМПАНИЯ УРАЛЬСКИЕ АВИАЛИНИИ,6608003013,телефонный проводной трубка сбор связь бортпро...,0


In [None]:
# Export dataframe with cluster assignments to S3 Bucket
df.to_csv('s3://labs20-arms-bucket/data/df_train_with_clustersIF2.csv')

### Initial Analysis
Once our model was saved, we recreated the 'tokenize' function and loaded the pickled vectorizer to view the results of our clustering. For some reason we could not load the pickled 'tokenize' function from our S3 bucket, so we had to recreate it each time.  Because the tokenize function was used in our vectorizer, it must be loaded into the memory of the notebook before the vectorizer is loaded from the S3 bucket, otherwise an error message will generate.

To see how our model/vectorizer performed we created a for loop that generated the top 10 terms per cluster and translated them to English.  To our delight, the clusters seemed to organize the text corpuses around specific industries; clusters 0, 6, and 7 even displayed words strongly associated with arms exports and heavy machine building!  This is a sign that our model grouped similar trades correctly.

In [None]:
# create S3 object
s3 = boto3.resource('s3')
bucket=s3.Bucket('labs20-arms-bucket')

In [21]:
# load vectorizer for analysis
key = "vectorizerf.pkl"
# test READ/LOAD of vectorizer from S3 bucket
with tempfile.TemporaryFile() as fp:
    bucket.download_fileobj(Fileobj=fp, Key=key)
    fp.seek(0)
    vectorizer = joblib.load(fp)

In [24]:
# should not run the code many time because the googleapi will give you 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
# This for loop generates the top 10 terms per cluster and translates them to English
translator = Translator()
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(model.n_clusters):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :15]:
        print("Russian:",' %s' % terms[ind], ", English:", translator.translate('{}'.format(terms[ind])).text)
    print

Top terms per cluster:
Cluster 0:
Russian:  шт , English: PCS
Russian:  часть , English: part
Russian:  ваз , English: vases
Russian:  изделие , English: product
Russian:  марка , English: mark
Russian:  гост , English: guest
Russian:  материал , English: material
Russian:  назначение , English: appointment
Russian:  сталь , English: steel
Russian:  вес , English: weight
Russian:  новый , English: new
Russian:  длина , English: length
Russian:  рулон , English: roll
Russian:  металл , English: metal
Russian:  диаметр , English: diameter
Cluster 1:
Russian:  содержать , English: contain
Russian:  спирт , English: alcohol
Russian:  какао , English: cocoa
Russian:  этиловый , English: ethyl
Russian:  сахар , English: sugar
Russian:  этиловый спирт , English: ethanol
Russian:  содержание , English: content
Russian:  вещество , English: substance
Russian:  добавка , English: supplement
Russian:  средство , English: means
Russian:  изделие , English: product
Russian:  жир , English: fat
Russ

In [26]:
# Create listAllcluster list to count the number of trade text corpuses falling into each cluster for all INN numbers.
listAllcluster = df.cluster.value_counts()
listAllcluster

0    4933080
5     845003
6     524934
1     459900
3     450046
4     384065
2     267521
8     176743
9     141009
7     129705
Name: cluster, dtype: int64

### Further Exploration/Known Russian Arms Exporters
We now introduce known arms exporter INNs into our analysis.  Unfortunately, the list containing these INNs was manually put together by Labs16 & Labs20 groups, and there is neither a database containing this information or a way to automatically scrape them into our notebook.  The GoogleSpreadsheet containing the running list of known Russian arms exporter INNs can be found here: https://docs.google.com/spreadsheets/d/1-RDS-STLXPQ3tMkPe4hXwt7CL_fM8IrgRWn0uDX6IF4/edit?usp=sharing

As stated in Notebook 3, the goal here is to create a 'profile' for known arms exporters.  This is done by totalling the trades per cluster for all INNs.  Once totalled, the known exporter totals dataframe will be exported to our S3 bucket and used as a comparision for our product's final analysis.

In [1]:
# This cell saves the string list of known arms exporter inns as variable inn_arms_exp_total
inn_arms_exp_total = ['7718852163',  '7740000090',    '7731084175',  '6161021690',
                      '3807002509',  '6672315362',    '7802375335',  '7813132895',  
                      '7731280660',  '7303026762',    '5040007594',  '2501002394',  
                      '7807343496',  '7731559044',    '5042126251',  '7731595540',    
                      '7733018650',  '7722016820',    '7705654132',  '7714336520',    
                      '7801074335',  '6229031754',    '7830002462',  '6825000757',  
                      '5043000212',  '7802375889',    '5010031470',  '1660249187',  
                      '7720015691',  '6154573235',    '5038087144',  '7713006304',  
                      '7805326230',  '5023002050',    '4007017378',  '7714013456',  
                      '17718852163', '7811406004',    '7702077840',  '7839395419',  
                      '7702244226',  '7704721192',    '7731644035',  '7712040285',
                      '7811144648',  '4345047310',    '7720066255',  '6607000556',
                      '1832090230',  '1835011597',    '3305004083',  '4340000830',
                      '5074051432',  '1841015504',    '7105008338',  '7106002829', 
                      '7704274402',  '5942400228',    '7105514574',  '5012039795', 
                      '7714733528',  '3904065550',    '6825000757',  '7807343496', 
                      '7731559044',  '7805231691',    '7704859803',  '0273008320',
                      '7704274402',  '2902059091',    '7805034277',  '7727692011',
                      '7733759899',  '6154028021',    '7328032711',  '2635002815',
                      '5040097816',  '5027033274',    '5250018433',  '5200000046',
                      '7743813961',  '7718016666',    '5047118550',  '7704274402']

We created a simple `for` loop to check how many times our known INN numbers appeared in the dataset.

In [28]:
# create list of predicted inns of all trades assigned to all clusters
predicted_INNs_total = list(df['CONSIGNOR_INN'])

# create 'for' loop to see how manny times know INNs show up in predicted_INNs list
# some INNS are very present in clusters 4 and 7, others not so much
# expanding list of known arms exporters would be very helpful for this portion
for i in inn_arms_exp_total:
    print("INN:", i, "  " ,"number of occurances in training dataset (all clusters):", predicted_INNs_total.count(i))

INN: 7718852163    number of occurances in training dataset (all clusters): 41343
INN: 7740000090    number of occurances in training dataset (all clusters): 1588
INN: 7731084175    number of occurances in training dataset (all clusters): 1250
INN: 6161021690    number of occurances in training dataset (all clusters): 1143
INN: 3807002509    number of occurances in training dataset (all clusters): 2489
INN: 6672315362    number of occurances in training dataset (all clusters): 579
INN: 7802375335    number of occurances in training dataset (all clusters): 118
INN: 7813132895    number of occurances in training dataset (all clusters): 212
INN: 7731280660    number of occurances in training dataset (all clusters): 492
INN: 7303026762    number of occurances in training dataset (all clusters): 461
INN: 5040007594    number of occurances in training dataset (all clusters): 167
INN: 2501002394    number of occurances in training dataset (all clusters): 243
INN: 7807343496    number of occur

### Final Analysis
We then totaled the trade-count-per-cluster for ALL known Russian Arms exporters, transposed it, and saved it to a dataframe.  This transposed dataset will be used as the final comparision in our product.  For example:

| Cluster% | Known Arms Exporters% | INN in Dataset% |
| --- | --- | --- |
| % of trades falling into cluster0 | 30% | 26% |
| % of trades falling into cluster1 | 12% | 6% |
| % of trades falling into cluster2 | 5% | 15% |
| % of trades falling into cluster3 | 5% | 0% |
| % of trades falling into cluster4 | 5% | 2% |
| % of trades falling into cluster5 | 20% | 18% |
| % of trades falling into cluster6 | 5% | 3% |
| % of trades falling into cluster7 | 1% | 12% |
| % of trades falling into cluster8 | 3% | 0% |
| % of trades falling into cluster9 | 5% | 15% |

In [34]:
# Create subset of dataframe containing trade corpuses of known arms exporters only
X = df[df['CONSIGNOR_INN'].isin(inn_arms_exp_total)]

# Convert known arms exporter data
listArmcluster = pd.DataFrame(X.cluster.value_counts())
# add 0 value for cluster 9 index, as no known arms exporter trades fell into cluster 9
listArmcluster.loc[9]=0
listArmcluster

Unnamed: 0,cluster
0,55151
1,9961
6,8723
7,1356
8,348
2,81
3,42
5,23
4,13
9,0


In [35]:
# arms cluster fixing
listArmcluster = listArmcluster.sort_index().reset_index()
listArmcluster = listArmcluster.rename(columns={'index': "cluster", 'cluster': "known_AE_trade_count"})
listArmcluster = listArmcluster.T.reset_index().rename(columns={'index':'CONSIGNOR_INN', 0:'clust0', 1:'clust1', 2:'clust2',
                                                    3:'clust3', 4:'clust4', 5:'clust5', 6:'clust6',7:'clust7', 8:'clust8', 9:'clust9'}).rename_axis('', axis=1)
listArmcluster = listArmcluster.drop([0])

# 'profile' for russian arms exporters
listArmcluster

Unnamed: 0,CONSIGNOR_INN,clust0,clust1,clust2,clust3,clust4,clust5,clust6,clust7,clust8,clust9
1,known_AE_trade_count,55151,9961,81,42,13,23,8723,1356,348,0


In [36]:
# export listAllcluster to S3 bucket
listArmcluster.to_csv('s3://labs20-arms-bucket/data/armsclustersf.csv')