# Explicações

O objetivo aqui é identificar as expressões mais frequentes, onde uma expressão é um conjunto de palavras seguidas das outras.

Para tal, primeiro realizamos o split de cada frase (quebrando por espaços), e a partir disso, usamos a classe `NGram` do Spark.

Por fim, o NGram realiza a transformação desejada, por exemplo, se você tem a frase splitada, usando N = 3:

`[this, is, the, most, cool]`

O NGram produzirá como array de saída frases formadas com as palavras em sequência, nesse caso:

`[this is the, is the most, the most cool]`

Exatamente conforme desejado.

Por fim, aplicamos o método explode para transformar os elementos da lista em linhas individuais, e por fim, aplicamos um groupBy para agregar e contar as frases comuns.

Link para documentação: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.feature.NGram.html

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import length, element_at, size, split, udf, explode, desc, lower
from pyspark.sql.types import StringType
from pyspark.ml.feature import NGram
import re
EXPRESSIONS_SIZE = 3


spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.json('file:///home/ec2-user/eiffel-tower-reviews.json').select(lower('text').alias('text'))
df.show()

Setting default log level to "

22/12/08 22:00:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+--------------------+
|                text|
+--------------------+
|this is the most ...|
|my significant ot...|
|we had a tour to ...|
|visited with my w...|
|we went in the ni...|
|dont hesitate and...|
|i enjoyed the tow...|
|read through the ...|
|this by far was o...|
|something you hav...|
|the views are bea...|
|worth spending a ...|
|took the tour to ...|
|a fantastic fusio...|
|whatever you do i...|
|not to miss..beau...|
|we visited in the...|
|go for sunset and...|
|we booked weeks a...|
|eiffel tower is j...|
+--------------------+
only showing top 20 rows



Primeiramente, quebramos strings de entrada por espaços, transformando cada string em um array de palavras

In [2]:
df = df.select(split(df.text,' ').alias('words')).na.drop()
df.show()

+--------------------+
|               words|
+--------------------+
|[this, is, the, m...|
|[my, significant,...|
|[we, had, a, tour...|
|[visited, with, m...|
|[we, went, in, th...|
|[dont, hesitate, ...|
|[i, enjoyed, the,...|
|[read, through, t...|
|[this, by, far, w...|
|[something, you, ...|
|[the, views, are,...|
|[worth, spending,...|
|[took, the, tour,...|
|[a, fantastic, fu...|
|[whatever, you, d...|
|[not, to, miss..b...|
|[we, visited, in,...|
|[go, for, sunset,...|
|[we, booked, week...|
|[eiffel, tower, i...|
+--------------------+
only showing top 20 rows



Em seguida, aplicamos o transformador `NGram` do Spark, para obter as frases (aqui optamos por n = 3 para o tamanho da expressão) 

In [3]:
ngram_gen = NGram(n=EXPRESSIONS_SIZE, inputCol="words", outputCol="phrases")
processed_df = ngram_gen.transform(df)

processed_df.show()

+--------------------+--------------------+
|               words|             phrases|
+--------------------+--------------------+
|[this, is, the, m...|[this is the, is ...|
|[my, significant,...|[my significant o...|
|[we, had, a, tour...|[we had a, had a ...|
|[visited, with, m...|[visited with my,...|
|[we, went, in, th...|[we went in, went...|
|[dont, hesitate, ...|[dont hesitate an...|
|[i, enjoyed, the,...|[i enjoyed the, e...|
|[read, through, t...|[read through the...|
|[this, by, far, w...|[this by far, by ...|
|[something, you, ...|[something you ha...|
|[the, views, are,...|[the views are, v...|
|[worth, spending,...|[worth spending a...|
|[took, the, tour,...|[took the tour, t...|
|[a, fantastic, fu...|[a fantastic fusi...|
|[whatever, you, d...|[whatever you do,...|
|[not, to, miss..b...|[not to miss..bea...|
|[we, visited, in,...|[we visited in, v...|
|[go, for, sunset,...|[go for sunset, f...|
|[we, booked, week...|[we booked weeks,...|
|[eiffel, tower, i...|[eiffel to

Em seguida, aplicamos um explode para obter a lista de frases (expressões) presentes no texto original.

In [4]:
common_phrases_df = processed_df.select(explode(processed_df.phrases).alias("expressions"))
common_phrases_df.show()

+--------------------+
|         expressions|
+--------------------+
|         this is the|
|         is the most|
|    the most busiest|
|most busiest attt...|
|busiest atttactio...|
| atttaction in paris|
|        in paris and|
|     paris and there|
|       and there are|
|      there are some|
|       are some nice|
|some nice restaur...|
| nice restaurants on|
|   restaurants on it|
|           on it and|
|          it and the|
|       and the views|
|      the views were|
|views were specta...|
|were spectacular and|
+--------------------+
only showing top 20 rows



Por fim, com as frases prontas, basta agregar os dados e contar as frases mais frequentes.

In [5]:
phrases_frequency = common_phrases_df.groupBy("expressions").count()
phrases_frequency.orderBy(desc("count")).show()

+-----------------+-----+
|      expressions|count|
+-----------------+-----+
| the eiffel tower| 1607|
|       to the top|  795|
|        go to the|  490|
|  eiffel tower is|  488|
|the eiffel tower.|  441|
|        up to the|  419|
|         to go up|  366|
|      to the top.|  362|
|       to see the|  352|
|     of the tower|  352|
|       the top of|  352|
|         to go to|  349|
|       one of the|  337|
|       top of the|  302|
|    to the second|  300|
|    of the eiffel|  280|
|      you have to|  268|
|     from the top|  266|
|    to the eiffel|  248|
|     the tower is|  245|
+-----------------+-----+
only showing top 20 rows

