# Notebook Tasks

- Write a cosine similarty function
- Implement [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) from sklearn to create a term document matrix 
- Implementation of SVD with the use of [`np.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html)
- Print the first few principal values of the given decomposition
- Transform the decomposition result $V^{T}$ to $V$
- Reduce V to a lower rank representation, to cutoff all principal vectors beyond `V[0]`, so we will only retain the first column of V, the principal axis

# Imports

In [1]:
# standard imports
import numpy as np
import pandas as pd
import os

# text processing
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re, string
from sentence_transformers import SentenceTransformer, util

# other useful imports
from importlib import reload
from time import time
from pathlib import Path
from tqdm.notebook import tqdm # progress bar

# plotting
from pylab import plt, mpl

#ignore warnings
import warnings
warnings.filterwarnings(action='ignore')

# File Paths

In [2]:
DATA_PATH = 'data/reports'
SDG_PATH = 'data/sdg.csv'
FILE_NAME_DOC = 'data/doc_df.csv'
FILE_NAME_DOC = 'data/doc_df.csv'
FILE_SDG_GROUPED = 'data/df_sdg_doc.csv'
FILE_SDG = 'data/df_sdg.csv'
FILE_NAME_SENT = 'data/sentences.csv'

# Helper Functions

### Cosine Similarity

Write a function named`cosine_similarity`, which takes two parameters, `vector_1` and `vector_2` and returns the computed cosine similarity as shown in the mathmatical formula shown in **equation number 3**, which is explained in full if you scroll a bit down in this notebook.

In [3]:
def cosine_similarity(vector_1, vector_2):
    """ Computes the cosine similarity between two vectors given as dictionaies of the form {word: tf-idf score} """
    
    vec1 = [val for val in vector_1.values()]
    vec2 = [val for val in vector_2.values()]
    
    if len(vec1) == len(vec2):
        vec1_dot_vec2 = sum([vec1[i] * vec2[i] for i in range(len(vec1))])
        mag_vec1 = np.sqrt(sum([vec1[i]**2 for i in range(len(vec1))]))
        mag_vec2 = np.sqrt(sum([vec2[i]**2 for i in range(len(vec2))]))
        #return np.dot(vector_1, vector_2) / (np.linalg.norm(vector_1) * np.linalg.norm(vector_2))   
        return vec1_dot_vec2 / (mag_vec1 * mag_vec2)
    else:
        return print(f"Vectors should have the same length!")

In [4]:
# test cosine similarity
v1 =  {"Amazon" : 0.1,
       "Microsoft" : 0.3,
      "Google" : 0.2,
      "Facebook" : 0,
      "Apple" : 0.4}

v2 =  {"Amazon" : 0,
       "Microsoft" : 0.1,
      "Google" : 0,
      "Facebook" : 5,
      "Apple" : 0}

cosine_similarity(v1, v2)

0.010952260916921357

## Plotting

In [5]:
def ordered_compare_plot(data):
    # plot data in stack manner of bar type
    data.plot(x='SDG goal',
              kind='bar',
              edgecolor="black",
              linewidth=0.2,
              figsize=(15, 8))

    plt.ylabel('Score in %')
    plt.xlabel('')
    plt.xticks(rotation=45)
    plt.ylim(0, 0.25)
    plt.legend(loc='upper center',
               bbox_to_anchor=(0.5, -0.13),
               fancybox=True,
               shadow=True,
               ncol=7)

    plt.grid()

In [44]:
doc_df

Unnamed: 0,doc_id,sentence
0,1,Introduction A message from our CEO A message...
1,2,This past year has brought disruption and stre...
2,3,I approach this with a strong point of view. I...
3,4,This Report details the progress of the & Fa...
4,5,Climate Change and Greenhouse Gas Emissions 52...
5,6,"This Environmental, Social and Governance Repo..."
6,7,216 million people could be forced to migrat...
7,8,Current ESG evaluation methodologies are fun...
8,9,"For the best experience, we recommend using t..."
9,10,Cover photo This North Carolina solar facility...


In [42]:
df_sdg_doc

Unnamed: 0,gpname,sentence
0,Economic and Technological Development,"economic growth, full and productive employmen..."
1,Environments,Take urgent action to combat climate change an...
2,Equity,Ensure inclusive and equitable quality educati...
3,Life,End poverty in all its forms everywhere Despit...
4,Resources,Ensure availability and sustainable management...
5,Social Development,"Make cities and human settlements inclusive, s..."


In [43]:
df_sdg

Unnamed: 0,gpnum,gpname,goalnum,sentence
0,gp01,Life,goal01,End poverty in all its forms everywhere
1,gp01,Life,goal01,"Despite progress under the MDGs, approximately..."
2,gp01,Life,goal01,"Over the past decade, markets in developing co..."
3,gp01,Life,goal01,Certain groups are disproportionately represen...
4,gp01,Life,goal01,"These include women, persons with disabilities..."
...,...,...,...,...
636,gp06,Environments,goal15,15.7 Take urgent action to end poaching and tr...
637,gp06,Environments,goal15,"15.8 By 2020, introduce measures to prevent th..."
638,gp06,Environments,goal15,"15.9 By 2020, integrate ecosystems and biodive..."
639,gp06,Environments,goal15,15.a Mobilize and significantly increase from ...


In [45]:
reports_sent

Unnamed: 0,doc_id,file_name,sentence
0,1,United Health Group.pdf,Introduction A message from our CEO A message...
1,1,United Health Group.pdf,"At UnitedHealth Group, we believe a healthy p..."
2,1,United Health Group.pdf,"The more than 350,000 people across Optum and..."
3,1,United Health Group.pdf,Given our reach and resources – and the milli...
4,1,United Health Group.pdf,That’s what makes a health care system sustai...
...,...,...,...
12540,14,Amazon.pdf,All statements other than statements of histor...
12541,14,Amazon.pdf,"We use words such as aim, believe, commit, dr..."
12542,14,Amazon.pdf,Forward-looking statements reflect management’...
12543,14,Amazon.pdf,Actual results could differ materially due to ...


In [6]:
# generate list of companies from path
name_list = []
for idx, file_name in enumerate(os.listdir(DATA_PATH)):
    name_list.append(Path(file_name).stem)

In [7]:
# remove hidden file frm list
try: 
    name_list.remove('.DS_Store')
except:
    print("Not on a Mac, can't remove .DS_Store!")
else:
    print("Removed .DS_Store from company list")
    
name_list

Not on a Mac, can't remove .DS_Store!


['United Health Group',
 'JPmorgan',
 'P & G',
 'Johnson-Johnson',
 'NVDIA',
 'Broadcom',
 'Meta',
 'Tesla',
 'Microsoft',
 'Apple',
 'Coca-Cola',
 'Google',
 'Exxon',
 'Amazon']

In [8]:
# document with all text of each PDF file of a company in one row
doc_df = pd.read_csv(FILE_NAME_DOC)

# SDG data with groupby
df_sdg_doc = pd.read_csv(FILE_SDG_GROUPED)

# cleaned SDG data
df_sdg = pd.read_csv(FILE_SDG)

# each sentence in a separate row
reports_sent = pd.read_csv(FILE_NAME_SENT)

In [13]:
# we use this later for iteration
indices = []
# get correct start_number for doc_id
start_number = doc_df['doc_id'].iloc[-1]
df_length = df_sdg_doc.shape[0]

for i in range(df_length):
    # print(i)
    start_number += 1
    indices.append(start_number-1)
    # print(start_number)
    df_sdg_doc.replace(df_sdg_doc['gpname'].iloc[i],
                       start_number, inplace=True)

In [16]:
df_sdg_doc.rename(columns={'gpname': 'doc_id'}, inplace=True)
df_sdg_doc

Unnamed: 0,doc_id,sentence
0,15,"economic growth, full and productive employmen..."
1,16,Take urgent action to combat climate change an...
2,17,Ensure inclusive and equitable quality educati...
3,18,End poverty in all its forms everywhere Despit...
4,19,Ensure availability and sustainable management...
5,20,"Make cities and human settlements inclusive, s..."


In [17]:
# concatenate the two dataframes, reset index-ingore_index=True, and drop the old index, .reset_index(drop=True)
complete_df = pd.concat([doc_df, df_sdg_doc], axis=0, ignore_index=True)
complete_df  = complete_df.reset_index(drop=True)
complete_df

Unnamed: 0,doc_id,sentence
0,1,Introduction A message from our CEO A message...
1,2,This past year has brought disruption and stre...
2,3,I approach this with a strong point of view. I...
3,4,This Report details the progress of the & Fa...
4,5,Climate Change and Greenhouse Gas Emissions 52...
5,6,"This Environmental, Social and Governance Repo..."
6,7,216 million people could be forced to migrat...
7,8,Current ESG evaluation methodologies are fun...
8,9,"For the best experience, we recommend using t..."
9,10,Cover photo This North Carolina solar facility...


###  Count Vectorizer

Implement [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) from sklearn to create a term document matrix, needed to perform cosine similarity and LSA computations. The following set-up is used:

- Set the `token_pattern` to exclude numbers 
- Use `custom_stopwords` as initialized in the cell below
- Set `max_features` to 500, thus only use the 500 most frequently ocurring words
- Fit the `Count Vectorizer` with the sentences of our complete_df data frame 

In [75]:
# Text preprocessing and vecotization on a "sentence" column of complete_df

# create custom stopwords to be removedduring peprocessing step including company names and "exxonmobil" since in the company name list is "Exxon"
custom_stopwords = list(set(stopwords.words('english') + ['s', 'data', 'also', 'exxonmobil'] + name_list))

# the vectorizer will convert the text data into a matrix of token counts of the only 500 most frerquent words
vect = CountVectorizer(stop_words=custom_stopwords, token_pattern=r'[a-z]+', max_features=500)

# transform the sentence column into a matrix of token counts
vects = vect.fit_transform(complete_df['sentence'])

# convert a sparse matrix (only the non-zero values are stored and their locations) 
# into a danse matrix (2d with  defined * for matrix multiplication) not an numpy array 
td = pd.DataFrame(vects.todense()) 

In [105]:
print(f"term document matrix for the {td.shape[0]} documnets and {td.shape[1]} featured words")
td.head()

term document matrix for the 20 documnets and 500 featured words


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,6,68,12,49,10,7,13,7,9,23,...,12,60,5,34,32,3,10,23,17,10
1,6,14,3,29,9,3,6,13,17,9,...,3,31,2,6,13,1,14,19,5,6
2,2,3,4,9,5,3,1,1,3,0,...,1,3,3,1,3,0,12,12,5,2
3,16,81,15,131,31,13,27,24,27,36,...,66,75,48,21,36,6,111,91,55,5
4,7,14,6,28,9,19,26,11,14,14,...,28,66,18,22,38,8,39,50,17,5


In [98]:
# the dictionary is an attribute of the CountVectorizer, list of feature names ordered by vocabulary index
words = vect.get_feature_names_out()
print(f"first 10 out of {len(words)} feature words in the vocabulary: {words[:10]}")

first 10 out of 500 feature words in the vocabulary: ['ability' 'access' 'achieve' 'across' 'action' 'actions' 'activities'
 'addition' 'additional' 'address']


In [23]:
# the term-documnet matrix describes the frequency of words that occur in a corpus (collection of documents)
# set the column names of the dataframe to the feature names (the words in the vocabulary) of the vectorizer
td.columns = vect.get_feature_names_out()
term_document_matrix = td.T
# set the column names of the dataframe to 'doc1', 'doc2', etc.
term_document_matrix.columns = [
    'doc ' + str(i) for i in range(1, td.shape[0]+1)]
# add a column with the total count of words for each document
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)

# sort out the rows of the dataframe by the total count/ number of times a word 
# appear in all documents in descending order and keep only the top 50 
term_document_matrix = term_document_matrix.sort_values(
    by='total_count', ascending=False)[:50]

# drop "total_count" from data frame
term_document_matrix = term_document_matrix.drop(columns=['total_count'])
term_document_matrix = term_document_matrix.T
term_document_matrix.head()

Unnamed: 0,health,energy,water,emissions,carbon,employees,products,business,including,global,...,development,reduce,operations,safety,sustainable,sustainability,industry,rights,product,well
doc 1,481,20,29,17,8,43,6,85,56,13,...,19,19,16,4,9,45,16,12,0,33
doc 2,18,21,11,20,14,45,12,50,40,31,...,37,9,9,2,22,15,13,4,2,25
doc 3,6,3,15,5,2,7,9,6,6,4,...,0,2,2,0,1,6,1,0,4,2
doc 4,445,41,34,60,15,166,102,123,94,158,...,72,26,32,69,24,43,48,56,80,64
doc 5,38,67,45,38,13,111,53,72,45,48,...,39,14,18,56,7,11,29,18,42,17


# Cosine Similarity

The dot product between two vectors measures the agreement between them.
Given two vectors $\vec a$ and $\vec b$, we know that:
<br>
$$
\begin{equation}
\vec{a} \cdot \vec{b}=\|\vec{a}\|\|\vec{b}\| \cos (\theta),
\end{equation}
$$ <br>
where $\|\cdot\|$ implies length of a vector, and $\theta$ is the angle between the two vectors. <br>

Definition of the dot product for two vectors is: $\vec{a} = (a_1, a_2, a_3, \ldots)$ and $\vec{b} = (b_1, b_2, b_3, \ldots)$, where $a_{n}$ and $b_{n}$ are the components of the vector (features of the document, or TF-IDF values for each word of the document) and $n$ is the dimension of the vectors:

$$
\vec{a} \cdot \vec{b}=\sum_{i=1}^{n} a_{i} b_{i}=a_{1} b_{1}+a_{2} b_{2}+\cdots+a_{n} b_{n}
$$
You can see that the definition of the dot product is just the addition of the components of the two vectors, $\vec{a}, \vec{b}$, multiplied by one another. Here is an example of a dot product for two vectors with each having just two dimensions:


$$
\begin{equation}
\begin{aligned}
&\vec{a}=(0,5) \\
&\vec{b}=(7,0) \\
&\vec{a} \cdot \vec{b}= 0 * 7 + 5 * 0 =0
\end{aligned}
\end{equation}
$$


What we need to notice here is that the result of a dot product between two vectors isn’t another vector but it is a single value, a $scalar$.


The cosine is at its maximum possible value, 1, when the vectors are pointing in the same direction and the angle between them is zero. It progressively becomes smaller as the angle between the vectors increases until the two vectors become perpendicular to each other when the cosine becomes zero, implying no correlation - the vectors are independent of each other.

The magnitude of the dot product itself is also proportional to the length of the two vectors. Hence, we do not want to use the dot product par se as a measure of similarity between the vectors. Because then two long vectors would have a high score of similarity even if they are not aligned in direction. Rather, we want to use the cosine, defined as:

$$
\begin{equation}
\text{cosine\_similarity}(\vec{a}, \vec{b})=\frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\|\vec{b}\|}
\end{equation}
$$


Note, the cosine similarity between document vectors is often used to measure similarity between two documents. It is a principled way of measuring the degree of term sharing between the two documents.

But cosine similarity suffers from a significant drawback as we are only measuring the direct overlap between terms in documents. This is why we compare our results from the cosine similarity with LSA because LSA will help us to overcome that issue. 

**Resources:** <br> 
- Chaudhury, Krishnendu [Math and Architectures of Deep Learning](https://www.manning.com/books/math-and-architectures-of-deep-learning), Manning Publications, 2022

# Latent Semantic Analysis

![LSA](LSA_illustration.jpg)
<!-- img src="https://drive.google.com/uc?export=view&id=1YhNUPLCknyBdIii7sJQjx_ghiCZlWZcR" alt="LSA" width="1000" -->

**Figure 1:** Dual interpretation of SVD in terms of the basis vectors of both $D$ and $D^{T}$ <br>
**Image credit:** Aggarwal, Charu [Machine Learning for Text](https://link.springer.com/book/10.1007/978-3-319-73531-3), Springer, 2022.

Matrix factorization and dimensionality reduction belong to the general category of **latent factor models**. The reason why we might want to use such models, comes from the fact that sparse and high-dimensional representations
like from text documents work fine with some learning methods, but not with all methods. Hence, the need to compress the data representation to be able to express it in a smaller number of features, which are also known as **latent features** since they reflect hidden properties of the data and are not observed in the original data.

Dimensionality reduction is closely related to matrix factorization. Most types of dimensionality reduction transform the data matrices into a factorized form. Thus, the original data matrix $D$ can be approximately represented as a product of two or more matrices, therefore the total number of entries in the factorized matrices is far fewer than
the number of entries in the original data matrix.

A common way of representing an $n × d$ document-term matrix as the product of an $n×k$ matrix $U$ and a $d×k$ matrix $V$ is as follows:
$D ≈ UV^{T}$

Latent semantic analysis, non-negative matrix factorization, probabilistic latent semantic analysis, and latent Dirichlet allocation are popular techniques for dimensionality reduction in text. <br><br> 
**Note:** The _text-centric_ avatar for singular value decomposition is latent semantic analysis (LSA).


We already talked about cosine similarity and its drawbacks. For instance, if we look at the words "gender" and "equality" occurring together in many other publications, it would seem that they are somewhat related. However, the issue with cosine similarity across document vectors is that it ignores such auxiliary data. But words are recognized by the company they keep, therefore this is precisely the point at which LSA enters the picture. 

In other words, phrases that often appear together in documents are probably semantically related. Such keywords ought to be compiled into a single collection of words that share semantic similarities, which we refer to as a topic. Therefore, rather than using common phrases, document similarity should be assessed in terms of common topics. With that, we broaden the idea of common terminology between documents to include shared subjects. And we'll look at how to use this with our data within this milestone.

**Resources:** <br>
- Aggarwal, Charu [Machine Learning for Text](https://link.springer.com/book/10.1007/978-3-319-73531-3), Springer, 2022. 
- Chaudhury, Krishnendu [Math and Architectures of Deep Learning](https://www.manning.com/books/math-and-architectures-of-deep-learning), Manning Publications, 2022

In [24]:
terms = term_document_matrix.columns.to_list()
doc_term_matrix = term_document_matrix.to_numpy()

In [87]:
print(doc_term_matrix.shape)
print(f"terms: {terms}")
print()
print(f"doc term matrix:\n{doc_term_matrix}")

(20, 50)
terms: ['health', 'energy', 'water', 'emissions', 'carbon', 'employees', 'products', 'business', 'including', 'global', 'new', 'year', 'across', 'work', 'support', 'based', 'company', 'suppliers', 'world', 'climate', 'program', 'supply', 'waste', 'environmental', 'help', 'information', 'management', 'renewable', 'report', 'chain', 'human', 'impact', 'access', 'people', 'communities', 'care', 'million', 'u', 'provide', 'materials', 'development', 'reduce', 'operations', 'safety', 'sustainable', 'sustainability', 'industry', 'rights', 'product', 'well']

doc term matrix:
[[481  20  29  17   8  43   6  85  56  13  21  23  49  60  61  61  33  31
   10  14  58  11  11  45  67  66  37  12  41  13  24  29  68 105  39 315
   35  30  40   3  19  19  16   4   9  45  16  12   0  33]
 [ 18  21  11  20  14  45  12  50  40  31  27  19  29  31  30  25   4  12
   14  13  19   1   3  13  17  41  26   8  35   0   5  17  14   9  39   2
   19  34  25   3  37   9   9   2  22  15  13   4   2  25]
 

### SVD

The document-term matrix is facotized as $USV^{T}$, where each row of $U$ is a vector that represents a document, and each element corresponds to the associatian of that document with the topic, $S$ contains the singular values, thought as the "importance" of each topic, and the rows of $V^{T}$ representing the topics. Each row vector $V^{T}$ elements represent the association of that topic with each column (terms) 

Implementation of SVD with the use of [`np.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html).

In [88]:
# perform SVD with numpy
U, S, V_t = np.linalg.svd(doc_term_matrix)

In [107]:
# The matrix U is the left singular vectors, where each column represents a latent topic,
# and the value of each cell represents the strength of the association between the document 
# and the topic
print(U[:5,:5])

[[-2.64231034e-01  6.41023901e-01  5.39379543e-01  1.11693677e-01
  -7.84826321e-03]
 [-9.27571558e-02  3.02394164e-02 -1.24203895e-01 -3.52984173e-02
  -9.34512033e-02]
 [-2.51672891e-02  3.45435583e-03 -6.05068320e-03 -2.04680535e-02
  -2.55057422e-02]
 [-4.29907531e-01  5.19855186e-01 -2.34098324e-01 -1.08323156e-04
   2.73980235e-02]
 [-1.91990676e-01  1.60259163e-02 -2.72321018e-01 -1.45864396e-02
   1.68880067e-01]]


In [109]:
# The matrix S is a diagonal matrix (represented in Numpy as 1D array), where the values on the diagonal represent the singular values
# that can be thought of as the strength of the association between the document and the topic
S[:5]

array([1354.9682745 ,  742.38304913,  382.83928483,  368.60359269,
        192.78066677])

In [110]:
# The matrix V_t is the right singular vectors, where each row represents a latent topic,
# and each column, represents a term/feature; the value of each cell represents the 
# strength of the association of the corresponding  term with correrspinding topic
V_t[:5,:5]

array([[-0.27832741, -0.28900686, -0.27795487, -0.22940223, -0.24127756],
       [ 0.70864923, -0.27086429, -0.24882796, -0.18183269, -0.28968096],
       [ 0.30440861,  0.08058905,  0.31291376,  0.11398888,  0.34521828],
       [ 0.12422065,  0.45942908, -0.49490479,  0.03579609,  0.01085851],
       [ 0.0521996 ,  0.03390349,  0.16002769,  0.35301797, -0.46298129]])

In [89]:
# Note: NumPy returns S as 1D array of singular values, which can be convert it to a diagonal matrix with S_diag = np.diag(S)
print(f"U: {U.shape}")
print(f"S: {S.shape}")
print(f"V_t: {V_t.shape}")

U: (20, 20)
S: (20,)
V_t: (50, 50)


array([[-0.27832741, -0.28900686, -0.27795487, -0.22940223, -0.24127756,
        -0.16420074, -0.187064  , -0.15525569, -0.15069396, -0.15823892],
       [ 0.70864923, -0.27086429, -0.24882796, -0.18183269, -0.28968096,
         0.11644288, -0.07014911,  0.10030348,  0.03289415,  0.04224678],
       [ 0.30440861,  0.08058905,  0.31291376,  0.11398888,  0.34521828,
        -0.35909136, -0.184477  , -0.11538797, -0.10327968, -0.15448154],
       [ 0.12422065,  0.45942908, -0.49490479,  0.03579609,  0.01085851,
        -0.02459829,  0.29791007, -0.09696354, -0.02930115, -0.19303991],
       [ 0.0521996 ,  0.03390349,  0.16002769,  0.35301797, -0.46298129,
        -0.1247057 , -0.07657394, -0.14155118,  0.00094364,  0.02360342],
       [-0.22042002,  0.15040007,  0.120173  , -0.1767557 , -0.2262694 ,
        -0.08571725,  0.08077109,  0.08492573,  0.09315347, -0.24334507],
       [ 0.19316471,  0.05195226,  0.19656931, -0.44229371, -0.14897302,
        -0.31481674,  0.2703817 , -0.04544636

In [102]:
# print principal values of SVD decomposition are contained in  the diagonal of the matrix S, also known as the singular values,  non-negative and in descending order
principal_values = S
print(f"First 10 principal values:\n{principal_values[:10]}")

First 10 principal values:
[1354.9682745   742.38304913  382.83928483  368.60359269  192.78066677
  188.76171887  155.91378889  122.31253504   83.30245421   76.83504487]


###  Right Singular vectors

Transform the decomposition result $V^{T}$ to $V$. 

In [99]:
# The columns of V are the topic vectors. Each topic vector can
# be seen as a weighted sum of the terms in vocabulary.
V = V_t.T
# The right singular vectors of the doc  term matrix
print(f"The number of topics, as linear combination of feature(terms):\n{V.shape}")
print(f"The elements in the first singular vector indicate how much each term contributes to the first topic:\n{V[0]}")

The number of topics, as linear combination of feature(terms):
(50, 50)
The elements in the first singular vector indicate how much each term contributes to the first topic:
[-2.78327406e-01  7.08649232e-01  3.04408610e-01  1.24220652e-01
  5.21996033e-02 -2.20420025e-01  1.93164708e-01 -8.80759482e-02
  2.69321596e-02  1.11145049e-01  1.20189927e-01  8.46716157e-02
  7.18005096e-03 -9.57746592e-02 -9.85620315e-02 -2.78199866e-02
 -7.36265254e-02  1.06589096e-01 -3.59542887e-02  1.49190072e-02
 -5.29204743e-02  6.78720997e-02 -2.17396977e-04  2.62423580e-04
 -3.66778668e-02 -4.42450613e-02 -6.24502367e-02 -3.93729259e-02
 -1.99198284e-02  1.32653854e-02 -2.23612138e-02 -8.10456127e-03
 -1.73708852e-01 -8.48497974e-02 -1.20741726e-02 -2.50477682e-01
  8.10252136e-02 -1.00139803e-02 -3.88096256e-02  1.87166102e-02
  3.17958508e-02 -4.72776480e-02  1.78694147e-02 -5.39972115e-02
  6.87504262e-02  1.58780957e-02 -5.75645231e-04 -1.34635411e-02
 -5.10105463e-02 -5.33655084e-02]


### Dimensionality   Reduction

Reduce V to a lower rank representation, to cutoff all principal vectors beyond `V[0]`, so we will only retain the first column of V, the principal axis. 

In [29]:
# Reduce to a lower rank representation, since there is a big  drop in principal value from S[0] to S[1]. 
# choose to cutoff all principal vectors beyong V[0], retaining only the first column of V, the principal axis. 
rank = 1
U = U[:, :rank]
V = V[:, :rank]

In [30]:
V.shape

(50, 1)

In [31]:
# weighted contributions of the len(term_topic_affinity) terms to first topic
term_topic_affinity = list(zip(terms, V[:, 0]))

# print the topic affinity
print(term_topic_affinity)

[('health', -0.2783274062170559), ('energy', -0.28900686088977645), ('water', -0.27795487113510586), ('emissions', -0.22940222805847174), ('carbon', -0.24127756480490414), ('employees', -0.16420073713717798), ('products', -0.1870639957942257), ('business', -0.15525569154765015), ('including', -0.15069396143358504), ('global', -0.15823892241796858), ('new', -0.17813154706625928), ('year', -0.1650712339210693), ('across', -0.16419264406156736), ('work', -0.1469819478536872), ('support', -0.13694746393368348), ('based', -0.14511674998335564), ('company', -0.13715348743694758), ('suppliers', -0.14413332542281135), ('world', -0.1391029775171806), ('climate', -0.1265693931932309), ('program', -0.13316931578400032), ('supply', -0.128096222452824), ('waste', -0.1337181437681468), ('environmental', -0.11809557015663767), ('help', -0.11481104195442615), ('information', -0.10589790667751749), ('management', -0.09968359430301366), ('renewable', -0.1274283916594229), ('report', -0.10261586421078907

In [32]:
df_sdg = pd.read_csv(SDG_PATH)

df_sdg_doc = df_sdg.groupby('gpname')['sentence'].agg(' '.join).reset_index()

# save the "themes" as list, for later use
sdg_theme_list = df_sdg_doc['gpname'].tolist()

sdg_theme_list

['Economic and Technological Development',
 'Environments',
 'Equity',
 'Life',
 'Resources',
 'Social Development']

In [70]:
print(len(indices))
print(len(sdg_theme_list))

6
6


In [121]:
def cosine_similarity(vec_1, vec_2):
     vec_1_norm = np.linalg.norm(vec_1)
     vec_2_norm = np.linalg.norm(vec_2)
     return np.dot(vec_1, vec_2) / (vec_1_norm * vec_2_norm)

#create a array of sequential integers from 0 to the length of the name_list 
# in order to   access each company's document term matrix
com_indices = np.arange(len(name_list))

# loop over names and lists
for name, com_idx in zip(name_list, com_indices):
    print(f'Sustainability Report of {name}')
    print('-' *100)
    
    # loop over themes
    for idx, themes, in zip(indices, sdg_theme_list):
        similarity = cosine_similarity(
            doc_term_matrix[com_idx], doc_term_matrix[idx])
        print(f'Cosine similarity for SDG Theme "{themes}" in original space is {similarity:.8f}')
    print('\n')   
    print('=' *100)

Sustainability Report of United Health Group
----------------------------------------------------------------------------------------------------
Cosine similarity for SDG Theme "Economic and Technological Development" in original space is 0.28188363
Cosine similarity for SDG Theme "Environments" in original space is 0.17524713
Cosine similarity for SDG Theme "Equity" in original space is 0.37261825
Cosine similarity for SDG Theme "Life" in original space is 0.72075345
Cosine similarity for SDG Theme "Resources" in original space is 0.18975085
Cosine similarity for SDG Theme "Social Development" in original space is 0.23407232


Sustainability Report of JPmorgan
----------------------------------------------------------------------------------------------------
Cosine similarity for SDG Theme "Economic and Technological Development" in original space is 0.65787521
Cosine similarity for SDG Theme "Environments" in original space is 0.53307120
Cosine similarity for SDG Theme "Equity" in 

In [122]:
print(f"doc-term-matrix has {doc_term_matrix.shape[0]} rows=documents and {doc_term_matrix.shape[1]} columns=terms")
doc_term_matrix

doc-term-matrix has 20 rows=documents and 50 columns=terms


array([[481,  20,  29,  17,   8,  43,   6,  85,  56,  13,  21,  23,  49,
         60,  61,  61,  33,  31,  10,  14,  58,  11,  11,  45,  67,  66,
         37,  12,  41,  13,  24,  29,  68, 105,  39, 315,  35,  30,  40,
          3,  19,  19,  16,   4,   9,  45,  16,  12,   0,  33],
       [ 18,  21,  11,  20,  14,  45,  12,  50,  40,  31,  27,  19,  29,
         31,  30,  25,   4,  12,  14,  13,  19,   1,   3,  13,  17,  41,
         26,   8,  35,   0,   5,  17,  14,   9,  39,   2,  19,  34,  25,
          3,  37,   9,   9,   2,  22,  15,  13,   4,   2,  25],
       [  6,   3,  15,   5,   2,   7,   9,   6,   6,   4,   3,  12,   9,
          3,   5,   1,   3,   0,  12,   6,   5,   7,   1,   7,  17,   1,
          1,   0,   2,   6,   0,   7,   3,   9,   9,   4,  16,   5,   3,
          0,   0,   2,   2,   0,   1,   6,   1,   0,   4,   2],
       [445,  41,  34,  60,  15, 166, 102, 123,  94, 158, 132,  91, 131,
         75, 105,  80,  89,  55, 111,  32, 132,  59,  17,  51,  87,  73,
     

In [119]:
# transform the original documet-term matrix into a new space defined by the topics (latent features):
# the doc_term_matrtix rows represent docmqents and the columns represent terms/ most frequent words
# the V matrix is the right singular vectors, where each row represents a latent topic,  and each column
# represents a term/feature; the value of each cell represents the strength of the association of the 
# corresponding term with the corresponding topic. The multiplication of doc_term_matrix and V matrix
# will result in a new matrix, where the rows represent documents and the columns represent topics,
# and the values represent the strength of the association of the document with the topic
doc_topic_matrix = np.matmul(doc_term_matrix, V)
# loop over names of companies in the  order of the coresponding com_indices
for name, com_idx in zip(name_list, com_indices):
    print(f'Sustainability Report of {name}')
    print('-' *100)
    
    # loop over themes
    for idx, themes, in zip(indices, sdg_theme_list):
        # calculate the cosine similarity between the document topic vector of a company's sustainability report
        # and the document topic vector of a specific SDG - sustainable development goal theme
        similarity = cosine_similarity(
            doc_topic_matrix[com_idx], doc_topic_matrix[idx]) # note here we use doc_topic_matrix
        print(f'LSA for SDG Theme document {themes} is {similarity}')


    print('\n')   
    print('=' *100)

    # A higher cosine similarity score indicates company's sustainability report 
    # is more closely related to the sustainable development goal

(20, 50)
Sustainability Report of United Health Group
----------------------------------------------------------------------------------------------------
LSA for SDG Theme document Economic and Technological Development is 0.28188363207221795
LSA for SDG Theme document Environments is 0.17524712956954833
LSA for SDG Theme document Equity is 0.3726182469283598
LSA for SDG Theme document Life is 0.720753452678783
LSA for SDG Theme document Resources is 0.18975085286531088
LSA for SDG Theme document Social Development is 0.23407231791762925


Sustainability Report of JPmorgan
----------------------------------------------------------------------------------------------------
LSA for SDG Theme document Economic and Technological Development is 0.6578752119995852
LSA for SDG Theme document Environments is 0.5330711971032759
LSA for SDG Theme document Equity is 0.6767989121836512
LSA for SDG Theme document Life is 0.567898615688991
LSA for SDG Theme document Resources is 0.4367025200974748


In [127]:
print(f"doc-topic-matrix has {doc_topic_matrix.shape[0]} rows=topics and {doc_topic_matrix.shape[1]} columns=terms")
doc_topic_matrix

doc-topic-matrix has 20 rows=topics and 50 columns=terms


array([[-3.58024668e+02,  4.75885278e+02,  2.06495678e+02,
         4.11706907e+01, -1.51299341e+00,  7.92149098e+01,
        -2.82156731e+01,  3.65530608e+00,  3.70919924e+00,
        -3.95677014e+00, -1.11720579e+00, -1.05305337e+00,
         2.18365149e-01, -4.63745303e-01,  8.94136497e-02,
        -4.02844707e-01,  3.24844504e-02, -2.40652341e-01,
         8.34387680e-02,  4.05563908e-02, -1.35100264e-14,
         1.40998324e-14,  1.67713066e-14, -8.70137296e-15,
        -1.02695630e-15, -1.90264471e-14, -2.04836148e-14,
        -4.39370762e-14, -1.80411242e-16, -3.02535774e-15,
         3.81639165e-15, -2.41473508e-15, -6.68076705e-14,
        -2.07108636e-14, -1.67643677e-14,  1.36141098e-14,
         2.11670959e-14, -1.10189635e-14, -1.25836841e-14,
         5.25621213e-15,  3.04617442e-15, -3.11348169e-14,
        -1.31492039e-14, -8.54177840e-15,  8.66320904e-15,
         1.47260676e-14, -1.76941795e-15, -3.99680289e-15,
        -1.06581410e-14,  7.10542736e-15],
       [-1.25