<a href="https://colab.research.google.com/github/afortuny/SustainableFashionAI/blob/main/CircularityAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analyzing product reviews to understand circularity?

We will leverage the data from https://www.trailrunningreview.com/ , a leading company in product analysis, and we will evalute each trail running shoe from SS22 based on that dimensions:

Circularity:

*   Durability: Is the product make to last?
*   Versatility: can the product be used for multiple conditions /situations?
*   Sustainable materials: is the product made with organic, recycable or vegan materials?

Desirability:

*   Function: Is the product build up appropiate for its purpose?
*   Innovation: Is the product disrupting the market in some sense?
*   Price: Is the product affordable?



# Understanding the analytical problem at hand

Our dataset contain large product reviews from which we should be able to extract all the aspects above, with the exeption of price, which is already part of metadata. For the latter our plan is to simply create clusters of products based on their whole review similarity and calculate deviations with respect the average price for the cluster. For the other features we will use unsupersvised aspect sentiment analysis. To do that we need to follow the next steps:



1.   Use a pretrained model in the language of the corpora. In our case spanish.
2.   Detect the aspects of the text, map them with our key dimensions: durability, versaility, sustainability, functionality and innovation. 
3.   Cut the text parts related to the aspect
4.   Perform sentiment analysis of the aspect related chunks
5.   Provide a score based on the intensity of the sentiment per score.

We will try the following workflow on a single review to validate our process before we do the large scale data parsing and fine tune of the language model for the domain we are working on.







# Testing the workflow on a single review

In [3]:
import chardet    
rawdata = open('/content/drive/MyDrive/Sustainability Fashion AI/SampleReview.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print(charenc)

Windows-1252


In [4]:
import pandas as pd
review = pd.read_csv('/content/drive/MyDrive/Sustainability Fashion AI/SampleReview.csv',encoding = 'Windows-1252') 

In [8]:
review_txt = review['Review'].astype(str)

## detect the list of potential aspects and map them with our key terms based on similarity

In [18]:
!python -m spacy download es

2022-10-30 18:45:56.093685: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'es' are deprecated. Please use the
full pipeline package name 'es_core_news_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.4.0/es_core_news_sm-3.4.0-py3-none-any.whl (12.9 MB)
[K     |████████████████████████████████| 12.9 MB 2.6 MB/s 
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [19]:
import spacy
nlp = spacy.load("es_core_news_sm")

In [39]:
review_p = nlp(review_txt[0])
aspects_p = nlp("durabilidad sostenibilidad polivalencia funcionalidad innovacion")

In [74]:
scores = [(aspect.text, token.text, aspect.similarity(token)) for token in review_p.ents for aspect in aspects_p]

import pandas as pd

df = pd.DataFrame(scores)

  """Entry point for launching an IPython kernel.


In [75]:
df.columns =['aspect', 'term','similarity']

In [84]:
df = df.drop_duplicates(
  subset = ['aspect', 'term'],
  keep = 'last').reset_index(drop = True)

In [87]:
df_results = df.groupby('aspect').agg({'similarity': ['median', 'min', 'max']})

In [100]:
df_results = pd.DataFrame(df_results)

In [103]:
df_results.columns = ["median","min","max"]

In [106]:
max= df['similarity'].max()
df_results['score'] = df_results['max'].div(max)

In [107]:
df_results['score']

aspect
durabilidad       0.765589
funcionalidad     0.876159
innovacion        1.000000
polivalencia      0.638808
sostenibilidad    0.628852
Name: score, dtype: float64

In [89]:
df['similarity'].max()

0.5547420978546143

In [77]:
df = df[abs(df['similarity'])>0.05]

In [80]:
df.to_csv('export.csv')

In [None]:
for aspect in aspects_p:
    for token in review_p:
        print(aspect.text, token.text,aspect.similarity(token))


In [None]:
def spacy_most_similar(word, topn=10):
  ms = nlp_ru.vocab.vectors.most_similar(
      nlp_ru(word).vector.reshape(1,nlp_ru(word).vector.shape[0]), n=topn)
  words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
  distances = ms[2]
  return words, distances

In [36]:
def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("dog", topn=3)

[]

In [None]:
doc = nlp(review_txt[0])
aspects = "durabilidad sostenibilidad polivalencia funcionalidad innovacion"
aspects = nlp(aspects)


In [33]:
word = nlp.vocab[str(review_txt[0])]
word.vocab

<spacy.vocab.Vocab at 0x7fe07a0884b0>

In [34]:
def most_similar(word, topn=5):
  word = nlp.vocab[str(review_txt[0])]
  queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
  ]

  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("durabilidad", topn=10)

[]

In [None]:
# Store url
url = 'https://www.trailrunningreview.com/es/Adidas-Terrex-Agravic-Pro/REVIEW--2640.html'

In [None]:
# Import `requests`
import requests

# Make the request and check object type
r = requests.get(url)
type(r)

requests.models.Response

In [None]:
# Extract HTML from Response object and print
html = r.text
print(html)


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="es">
<head>
<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1'/>
<meta charset="UTF-8"/>
<title>Adidas Terrex Agravic Pro - TRAILRUNNINGReview.com</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta name="og:title" content="Adidas Terrex Agravic Pro - TRAILRUNNINGReview.com"/>
<meta name="fb:admins" content="649384080"/>
<meta name="Robots" content="ALL,INDEX,FOLLOW"/>
<meta name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"/>
<meta name="Revisit" content="9 days"/>
<meta name="language" content="ES"/>
<meta name="og:locale" content="ES"/>
<meta name="DC.Language" scheme="RFC1766" content="Spanish"/>
<meta name="distribution" content="global"/>
<meta name="copyright" content="TRAIL

In [None]:
# Import BeautifulSoup from bs4
from bs4 import BeautifulSoup as bs


# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)

bs4.BeautifulSoup

In [None]:
soup.find_all('Review')

[]

In [None]:
for link in soup.findAll('Review'):
  print(link.string)

In [None]:

soup = bs(urllib.urlopen(url))
for link in soup.findAll('Review'):
        print(link.string)

NameError: ignored

In [None]:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="es"><head>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<meta charset="utf-8"/>
<title>Adidas Terrex Agravic Pro - TRAILRUNNINGReview.com</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Adidas Terrex Agravic Pro - TRAILRUNNINGReview.com" name="og:title"/>
<meta content="649384080" name="fb:admins"/>
<meta content="ALL,INDEX,FOLLOW" name="Robots"/>
<meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="googlebot"/>
<meta content="9 days" name="Revisit"/>
<meta content="ES" name="language"/>
<meta content="ES" name="og:locale"/>
<meta content="Spanish" name="DC.Language" scheme="RFC1766"/>
<meta content="global" name="distribution"/>
<meta content="TRAILRUNNINGReview.com" 

In [None]:
soup.findAll('Ver Review entera')

[]

In [None]:
text = soup.get_text()

In [None]:
text

'\n\n\nAdidas Terrex Agravic Pro - TRAILRUNNINGReview.com\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n@font-face {\n  font-family: \'Source Sans Pro\';\n  font-style: italic;\n  font-weight: 400;\n  font-display: swap;\n  src: url(https://fonts.gstatic.com/s/sourcesanspro/v21/6xK1dSBYKcSV-LCoeQqfX1RYOo3qPa7g.ttf) format(\'truetype\');\n}\n@font-face {\n  font-family: \'Source Sans Pro\';\n  font-style: italic;\n  font-weight: 700;\n  font-display: swap;\n  src: url(https://fonts.gstatic.com/s/sourcesanspro/v21/6xKwdSBYKcSV-LCoeQqfX1RYOo3qPZZclRdr.ttf) format(\'truetype\');\n}\n@font-face {\n  font-family: \'Source Sans Pro\';\n  font-style: italic;\n  font-weight: 900;\n  font-display: swap;\n  src: url(https://fonts.gstatic.com/s/sourcesanspro/v21/6xKwdSBYKcSV-LCoeQqfX1RYOo3qPZZklxdr.ttf) format(\'truetype\');\n}\n@font-face {\n  font-family: \'Source Sans Pro\';\n  font-style: normal;\n  font-weight: 400;\n  font-display: swap;\n  src: url(https://fonts.gstat