# Loading Natural Language Text
## Chapter 2

Uma analise de linguagem natural analisando um romance do Sherlock Homes

In [1]:
try:
    !pip install pyspark=="2.4.5"  --quiet
except:
    print("Running throw py file.")

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import lower, col
import pandas as pd
import os

# Criando a session Spark

In [3]:
spark = SparkSession\
        .builder\
        .appName("Analise Sherlock homes - Fabio Kfouri")\
        .getOrCreate()

In [4]:
spark

In [5]:
#read book
df = spark.read.text("sherlock.txt")

Leitura da primeira linha da obra

In [6]:
print(df.first())

Row(value="Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle")


Quantidade de linhas

In [7]:
print(df.count())

12309


Visualizando um trecho da obra. Truncate setado como false permite a visualização de textos mais longos

In [8]:
df.show(15, truncate = False)

+----------------------------------------------------------------------------+
|value                                                                       |
+----------------------------------------------------------------------------+
|Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle|
|                                                                            |
|This eBook is for the use of anyone anywhere at no cost and with            |
|almost no restrictions whatsoever.  You may copy it, give it away or        |
|re-use it under the terms of the Project Gutenberg License included         |
|with this eBook or online at www.gutenberg.net                              |
|                                                                            |
|                                                                            |
|Title: The Adventures of Sherlock Holmes                                    |
|                                                   

Transformando em LowerCase e definido um Alias

In [9]:
df = df.select(lower(col('value')).alias('value'))
df.show(15, truncate = False)

+----------------------------------------------------------------------------+
|value                                                                       |
+----------------------------------------------------------------------------+
|project gutenberg's the adventures of sherlock holmes, by arthur conan doyle|
|                                                                            |
|this ebook is for the use of anyone anywhere at no cost and with            |
|almost no restrictions whatsoever.  you may copy it, give it away or        |
|re-use it under the terms of the project gutenberg license included         |
|with this ebook or online at www.gutenberg.net                              |
|                                                                            |
|                                                                            |
|title: the adventures of sherlock holmes                                    |
|                                                   

Replacing de textos e termos

In [10]:
df = df.select(F.regexp_replace('value', 'Mr\.', 'Mr').alias('value'))
df = df.select(F.regexp_replace('value', 'don\'t', 'do not').alias('value'))
#df = df.select(F.regexp_replace('value', '\'s', 'do not').alias('value'))

Tokenizing Text, retorna uma matriz de sequencias de caracteres

In [11]:
df1= df.select(F.split('Value', '[ ]').alias('words'))
df1.show(truncate=False)

+---------------------------------------------------------------------------------------------+
|words                                                                                        |
+---------------------------------------------------------------------------------------------+
|[project, gutenberg's, the, adventures, of, sherlock, holmes,, by, arthur, conan, doyle]     |
|[]                                                                                           |
|[this, ebook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with]              |
|[almost, no, restrictions, whatsoever., , you, may, copy, it,, give, it, away, or]           |
|[re-use, it, under, the, terms, of, the, project, gutenberg, license, included]              |
|[with, this, ebook, or, online, at, www.gutenberg.net]                                       |
|[]                                                                                           |
|[]                                     

Divide o texto e remove simbolos indesejados, tais como pontuacao.

In [75]:
punctuation = "_|.\?\!\",\'\[\]\*():;<>”“’"
df2 = df.select(F.split('value', '[ %s]' % punctuation).alias('words'))
df2.show(truncate=False)

+---------------------------------------------------------------------------------------------------+
|words                                                                                              |
+---------------------------------------------------------------------------------------------------+
|[project, gutenberg, s, the, adventures, of, sherlock, holmes, , by, arthur, conan, doyle]         |
|[]                                                                                                 |
|[this, ebook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with]                    |
|[almost, no, restrictions, whatsoever, , , you, may, copy, it, , give, it, away, or]               |
|[re-use, it, under, the, terms, of, the, project, gutenberg, license, included]                    |
|[with, this, ebook, or, online, at, www, gutenberg, net]                                           |
|[]                                                                               

Explodindo em Array o campo words, colocando cada palavra em uma linha e preservando uma ordem

In [76]:
df3 = df2.select(F.explode('words').alias('word'))
df3.show()

+----------+
|      word|
+----------+
|   project|
| gutenberg|
|         s|
|       the|
|adventures|
|        of|
|  sherlock|
|    holmes|
|          |
|        by|
|    arthur|
|     conan|
|     doyle|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
+----------+
only showing top 20 rows



Tamanho de comparacao entre os dataframes

In [77]:
print(df.count(), df3.count())

12309 132720


Removendo colunas vazias (empty)

In [78]:
noblank_df = df3.where(F.length('word') > 0)
print(df3.count(), noblank_df.count())

132720 108322


Adicionando um ID no dataset usando a funcao Monotonically_Increasing_id(), criando uma coluna de números inteiros crescentes de forma eficiente.

In [79]:
df4 = noblank_df.select('word', F.monotonically_increasing_id().alias('id'))
df4.show()

+----------+---+
|      word| id|
+----------+---+
|   project|  0|
| gutenberg|  1|
|         s|  2|
|       the|  3|
|adventures|  4|
|        of|  5|
|  sherlock|  6|
|    holmes|  7|
|        by|  8|
|    arthur|  9|
|     conan| 10|
|     doyle| 11|
|      this| 12|
|     ebook| 13|
|        is| 14|
|       for| 15|
|       the| 16|
|       use| 17|
|        of| 18|
|    anyone| 19|
+----------+---+
only showing top 20 rows



### Particionando os dados
O particionamento permite que o Spark paralelize operaçoes.

No exemplo usa-se a função When em conjunto com a operação withColumn.

In [80]:
df5 = df4.withColumn('title', F.when(df4.id < 25000, 'Preface')
                             .when(df4.id < 50000, 'Chapter 1')
                             .when(df4.id < 75000, 'Chapter 2')
                             .otherwise('Chapter 3'))

In [81]:
df5.show()

+----------+---+-------+
|      word| id|  title|
+----------+---+-------+
|   project|  0|Preface|
| gutenberg|  1|Preface|
|         s|  2|Preface|
|       the|  3|Preface|
|adventures|  4|Preface|
|        of|  5|Preface|
|  sherlock|  6|Preface|
|    holmes|  7|Preface|
|        by|  8|Preface|
|    arthur|  9|Preface|
|     conan| 10|Preface|
|     doyle| 11|Preface|
|      this| 12|Preface|
|     ebook| 13|Preface|
|        is| 14|Preface|
|       for| 15|Preface|
|       the| 16|Preface|
|       use| 17|Preface|
|        of| 18|Preface|
|    anyone| 19|Preface|
+----------+---+-------+
only showing top 20 rows



Colocando uma nova coluna chamada part

In [82]:
df5 = df5.withColumn('part', F.when(df5.id < 25000, 0)
                             .when(df5.id < 50000, 1)
                             .when(df5.id < 75000, 2)
                             .otherwise(3))

In [83]:
df5.show()

+----------+---+-------+----+
|      word| id|  title|part|
+----------+---+-------+----+
|   project|  0|Preface|   0|
| gutenberg|  1|Preface|   0|
|         s|  2|Preface|   0|
|       the|  3|Preface|   0|
|adventures|  4|Preface|   0|
|        of|  5|Preface|   0|
|  sherlock|  6|Preface|   0|
|    holmes|  7|Preface|   0|
|        by|  8|Preface|   0|
|    arthur|  9|Preface|   0|
|     conan| 10|Preface|   0|
|     doyle| 11|Preface|   0|
|      this| 12|Preface|   0|
|     ebook| 13|Preface|   0|
|        is| 14|Preface|   0|
|       for| 15|Preface|   0|
|       the| 16|Preface|   0|
|       use| 17|Preface|   0|
|        of| 18|Preface|   0|
|    anyone| 19|Preface|   0|
+----------+---+-------+----+
only showing top 20 rows



Reparticiona os dados em df5, criando uma novo quadro de dados no df6 "baseado em uma coluna".

<i>"Coloque linhas com o mesmo valor da coluna 'part' na mesma partição".</i>

In [84]:
df6 = df5.repartition(12, 'part')
print(df5.rdd.getNumPartitions(), df6.rdd.getNumPartitions(), )

1 12


In [85]:
df6.show()

+----------+---+-------+----+
|      word| id|  title|part|
+----------+---+-------+----+
|   project|  0|Preface|   0|
| gutenberg|  1|Preface|   0|
|         s|  2|Preface|   0|
|       the|  3|Preface|   0|
|adventures|  4|Preface|   0|
|        of|  5|Preface|   0|
|  sherlock|  6|Preface|   0|
|    holmes|  7|Preface|   0|
|        by|  8|Preface|   0|
|    arthur|  9|Preface|   0|
|     conan| 10|Preface|   0|
|     doyle| 11|Preface|   0|
|      this| 12|Preface|   0|
|     ebook| 13|Preface|   0|
|        is| 14|Preface|   0|
|       for| 15|Preface|   0|
|       the| 16|Preface|   0|
|       use| 17|Preface|   0|
|        of| 18|Preface|   0|
|    anyone| 19|Preface|   0|
+----------+---+-------+----+
only showing top 20 rows



In [86]:
#df6.coalesce(1).write.csv('spark_output/df6')

## Moving window analysis
Uma tecnica em que se faz uma analise em conjunto de linhas (tupla de 3).

In [87]:
df6.createOrReplaceTempView("temp")

In [88]:
query = """
        SELECT id, word AS w1,
               LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
               LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) as w3
        FROM temp
"""

spark.sql(query).sort('id').show()

+---+----------+----------+----------+
| id|        w1|        w2|        w3|
+---+----------+----------+----------+
|  0|   project| gutenberg|         s|
|  1| gutenberg|         s|       the|
|  2|         s|       the|adventures|
|  3|       the|adventures|        of|
|  4|adventures|        of|  sherlock|
|  5|        of|  sherlock|    holmes|
|  6|  sherlock|    holmes|        by|
|  7|    holmes|        by|    arthur|
|  8|        by|    arthur|     conan|
|  9|    arthur|     conan|     doyle|
| 10|     conan|     doyle|      this|
| 11|     doyle|      this|     ebook|
| 12|      this|     ebook|        is|
| 13|     ebook|        is|       for|
| 14|        is|       for|       the|
| 15|       for|       the|       use|
| 16|       the|       use|        of|
| 17|       use|        of|    anyone|
| 18|        of|    anyone|  anywhere|
| 19|    anyone|  anywhere|        at|
+---+----------+----------+----------+
only showing top 20 rows



Usando agora o LAG, visualizamos as linhas anteriores

In [89]:
query = """
        SELECT id, 
               LAG(word, 2) OVER(PARTITION BY part ORDER BY id) as w1,
               LAG(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
               word AS w3
        FROM temp
"""

spark.sql(query).sort('id').show()

+---+----------+----------+----------+
| id|        w1|        w2|        w3|
+---+----------+----------+----------+
|  0|      null|      null|   project|
|  1|      null|   project| gutenberg|
|  2|   project| gutenberg|         s|
|  3| gutenberg|         s|       the|
|  4|         s|       the|adventures|
|  5|       the|adventures|        of|
|  6|adventures|        of|  sherlock|
|  7|        of|  sherlock|    holmes|
|  8|  sherlock|    holmes|        by|
|  9|    holmes|        by|    arthur|
| 10|        by|    arthur|     conan|
| 11|    arthur|     conan|     doyle|
| 12|     conan|     doyle|      this|
| 13|     doyle|      this|     ebook|
| 14|      this|     ebook|        is|
| 15|     ebook|        is|       for|
| 16|        is|       for|       the|
| 17|       for|       the|       use|
| 18|       the|       use|        of|
| 19|       use|        of|    anyone|
+---+----------+----------+----------+
only showing top 20 rows



In [90]:
query = """
        SELECT id, 
               LAG(word, 2) OVER(PARTITION BY part ORDER BY id) as w1,
               LAG(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
               word AS w3
        FROM temp
        WHERE part = 2
"""

spark.sql(query).sort('id').show()

+-----+-------+-------+-------+
|   id|     w1|     w2|     w3|
+-----+-------+-------+-------+
|50000|   null|   null|   case|
|50001|   null|   case|against|
|50002|   case|against|    you|
|50003|against|    you|      i|
|50004|    you|      i|     do|
|50005|      i|     do|    not|
|50006|     do|    not|   know|
|50007|    not|   know|   that|
|50008|   know|   that|  there|
|50009|   that|  there|     is|
|50010|  there|     is|    any|
|50011|     is|    any| reason|
|50012|    any| reason|   that|
|50013| reason|   that|    the|
|50014|   that|    the|details|
|50015|    the|details| should|
|50016|details| should|   find|
|50017| should|   find|  their|
|50018|   find|  their|    way|
|50019|  their|    way|   into|
+-----+-------+-------+-------+
only showing top 20 rows



In [91]:
query = """
SELECT
part,
LAG(word, 2) OVER(PARTITION BY part ORDER BY id) AS w1,
LAG(word, 1) OVER(PARTITION BY part ORDER BY id) AS w2,
word AS w3,
LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) AS w4,
LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) AS w5
FROM temp
"""
spark.sql(query).where("part = 3").show(10)

+----+---------+---------+---------+------------+------------+
|part|       w1|       w2|       w3|          w4|          w5|
+----+---------+---------+---------+------------+------------+
|   3|        i|      had|       so|   foolishly|    rejected|
|   3|      had|       so|foolishly|    rejected|       ‘come|
|   3|       so|foolishly| rejected|       ‘come|        come|
|   3|foolishly| rejected|    ‘come|        come|         she|
|   3| rejected|    ‘come|     come|         she|       cried|
|   3|    ‘come|     come|      she|       cried|breathlessly|
|   3|     come|      she|    cried|breathlessly|       ‘they|
+----+---------+---------+---------+------------+------------+
only showing top 10 rows



Common word sequences.

Como identificar as sequencias de palavras mais frequentes em um documento de texto em idioma natural.

In [92]:
query3agg = """
select w1, w2, w3, count(*) as count FROM (
        SELECT id, word AS w1,
               LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
               LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) as w3
        FROM temp
        )
group by w1, w2, w3
order by count desc
"""

spark.sql(query3agg).show()

+-----+-----+-----+-----+
|   w1|   w2|   w3|count|
+-----+-----+-----+-----+
|  one|   of|  the|   48|
|    i|think| that|   45|
|   it|   is|    a|   45|
|   it|  was|    a|   45|
| that|   it|  was|   38|
|  out|   of|  the|   35|
| that|    i| have|   35|
| that|   it|   is|   34|
|    i|   do|  not|   34|
|there|  was|    a|   34|
| that|   he|  had|   30|
| that|   he|  was|   30|
| that|    i|  was|   28|
| lord|   st|simon|   28|
| that|    i|  had|   27|
|   in|front|   of|   27|
|    i| have|   no|   27|
|think| that|    i|   26|
|   to|   be|    a|   25|
|    i|could|  not|   24|
+-----+-----+-----+-----+
only showing top 20 rows



Uma outra analise, analisando pelas tuplas mais longas

In [97]:
query3agg = """
select w1, w2, w3, length(w1) + length(w2) + length(w3) as length FROM (
        SELECT id, word AS w1,
               LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
               LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) as w3
        FROM temp
        )
group by w1, w2, w3
order by length desc
"""

spark.sql(query3agg).show(truncate = False)

+-------------------+-------------------+-------------------+------+
|w1                 |w2                 |w3                 |length|
+-------------------+-------------------+-------------------+------+
|intellectual       |property           |trademark/copyright|39    |
|comfortable-looking|building           |two-storied        |38    |
|widespread         |comfortable-looking|building           |37    |
|interesting        |character—dummy    |bell-ropes         |36    |
|property           |trademark/copyright|agreement          |36    |
|probability—the    |strong             |probability—is     |35    |
|extraordinary      |circumstances      |connected          |35    |
|simple-minded      |nonconformist      |clergyman          |35    |
|particularly       |malignant          |boot-slitting      |34    |
|especially         |commercial         |redistribution     |34    |
|oppressively       |respectable        |frock-coat         |33    |
|unsystematic       |sensational  

Encontrando a maior sequencia por capitulo

In [110]:
query3agg = """
SELECT chapter, w1, w2, w3, count FROM
(
  SELECT
  chapter,
  ROW_NUMBER() OVER (PARTITION BY chapter ORDER BY chapter DESC) AS row,
  w1, w2, w3, count
  FROM ( --
  
          select chapter, w1, w2, w3, count(*) as count FROM (
                SELECT id, title as chapter, word AS w1,
                       LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) as w2,
                       LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) as w3
                FROM temp
                )
            group by chapter, w1, w2, w3
            order by count desc --
          )
    )
WHERE row = 1
ORDER BY chapter ASC

"""

spark.sql(query3agg).show()

+---------+----+---+-----+-----+
|  chapter|  w1| w2|   w3|count|
+---------+----+---+-----+-----+
|Chapter 1|that| he|  was|   16|
|Chapter 2|  it|was|    a|   15|
|Chapter 3|lord| st|simon|   28|
|  Preface|  it| is|    a|   15|
+---------+----+---+-----+-----+



## Caching

In [116]:
df6.is_cached()

TypeError: 'bool' object is not callable

In [113]:
type(df6)

pyspark.sql.dataframe.DataFrame

In [117]:
df.storageLevel.useMemory

False

In [119]:
df.rdd.getStorageLevel.useMemory

AttributeError: 'function' object has no attribute 'useMemory'