# Qué hemos hecho?

- Lanzamos cluster en aws
- Nos conectamos por consola y ssh al cluster desde el pc personal
- Instalamos jupyterhub y clonamos el  [repo](https://github.com/camilaMejia/trabajoFinal) del proyecto con este notebook listo. (En el github hay un archivo que se llama launch.txt donde están todas las intrucciones que se lanzan por comando)
- Instalamos y cargamos todas las librerias necesarias.
- Nos traemos el .dat y el .csv desde S3 al almacenamiento local
- Creamos el indice invertido usando metapy
- Hacemos querying usando BM25
- Se hace un LDA con todos las noticias (solo content + title)
- Para cada noticia hacemos vemos cual es el topico dominante

# Con respecto a la entrega

- Almacenamiento y Cluster de procesamiento en SparkML/Meta/NLTK en Amazon AWS : Check (Peso 40%)
- Indexación, búsqueda y recuperación con META : Check (Peso 40%)
- Modelado de tópicos: Check (Peso 10%)
- Análisis de sentimientos: Pendiente (Peso 10%)

En general estamos al 90% de ejecución

## Instalar librerias y complementos

In [1]:
! pip install pandas
! pip install pyspark
! pip install metapy
! pip install boto3
!pip install nltk
!pip install numpy
!pip install re
!pip install codecs
!pip install matplotlib

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/19/74/e50234bc82c553fecdbd566d8650801e3fe2d6d8c8d940638e3d8a7c5522/pandas-0.24.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy>=1.12.0 (from pandas)
  Using cached https://files.pythonhosted.org/packages/87/2d/e4656149cbadd3a8a0369fcd1a9c7d61cc7b87b3903b85389c70c989a696/numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: numpy, pandas
Successfully installed numpy-1.16.4 pandas-0.24.2
Collecting metapy
  Using cached https://files.pythonhosted.org/packages/81/a4/92dae084446597d6bbf355e7eaff3e83dcb51e33d434f43ecdea4c0c4b0a/metapy-0.2.13-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: metapy
Successfully installed metapy-0.2.13
Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/a6/1f/b272ead5ccc5370717f3c65ebd5092feab90e748db041bd96c565e7d1a72/boto3-1.9.169-py2.py3-none-any.whl (128kB)
[K     |████████████████████████████████| 133kB 31.1MB/s eta 

# Cargar librerias

In [9]:
import pandas as pd
import pyspark
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
import metapy
import requests, zipfile, io, os, boto3

import nltk
import pandas as pd
import numpy as np
import re
import codecs

from nltk.corpus import stopwords
 
stop_words_nltk = set(stopwords.words('english'))

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.ml.clustering import LDA, BisectingKMeans
from pyspark.sql.functions import monotonically_increasing_id
import re
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local', "app-topic-detection") 
spark = SparkSession(sc)

# Cargar datos necesarios 

In [3]:
s3 =  boto3.client('s3', region_name='us-east-1')
with open('data/news/news.dat', 'wb') as f:
    s3.download_fileobj('finaltext','news.dat', f)

obj = s3.get_object(Bucket='finaltext', Key=u'news.csv')
df = pd.read_csv(obj['Body'])

df['all']=df.title + df.content
df[['all']].head(20000).to_csv('aux.csv')

## Inverted index using metapy


In [4]:
#!rm -rf news-idx
idx = metapy.index.make_inverted_index('miniconfig.toml')


# IR: Querys

In [5]:
ranker = metapy.index.OkapiBM25()
query = metapy.index.Document()
query.content('Trump hates china') # query from AP news
top_docs = ranker.score(idx, query, num_results=5)

index=[tup[0] for tup in top_docs]
df.loc[index,['title','content']]


Unnamed: 0,title,content
23373,Ann Coulter: I Might Have Been Killed at the B...,"Ann Coulter, in a major interview with Vanity ..."
55967,A Trump Supporter Dwells in Beijing,BEIJING — Ardent Chinese supporters of Donald...
123632,"For Chinese officials, Trump perhaps better th...","In 2010, then Secretary of State Hillary Clin..."
119243,Massachusetts college apologizes for racist tw...,Salem State University President Patricia Mese...
25871,Donald Trump’s Hypocrisies - Breitbart,Part of Donald Trump’s appeal as a candidate i...


# LDA on spark

### Pre process data

Here we load data to spark and make some preprocessing over the text

In [None]:
df=spark.read.csv('aux.csv', inferSchema=True, header=True)


nltk.download('punkt')
nltk.download('stopwords')


# stopwords en nltk



rawdata = spark.read.load("aux.csv", format="csv", header=True)


def cleanup_text(record):
    text  = record[1]
    uid   = record[0]
    words = text.split()
    
    # Default list of Stopwords
    stopwords_core = ['a', u'about', u'above', u'after', u'again', u'against', u'all', u'am', u'an', u'and', u'any', u'are', u'arent', u'as', u'at', 
    u'be', u'because', u'been', u'before', u'being', u'below', u'between', u'both', u'but', u'by', 
    u'can', 'cant', 'come', u'could', 'couldnt', 
    u'd', u'did', u'didn', u'do', u'does', u'doesnt', u'doing', u'dont', u'down', u'during', 
    u'each', 
    u'few', 'finally', u'for', u'from', u'further', 
    u'had', u'hadnt', u'has', u'hasnt', u'have', u'havent', u'having', u'he', u'her', u'here', u'hers', u'herself', u'him', u'himself', u'his', u'how', 
    u'i', u'if', u'in', u'into', u'is', u'isnt', u'it', u'its', u'itself', 
    u'just', 
    u'll', 
    u'm', u'me', u'might', u'more', u'most', u'must', u'my', u'myself', 
    u'no', u'nor', u'not', u'now', 
    u'o', u'of', u'off', u'on', u'once', u'only', u'or', u'other', u'our', u'ours', u'ourselves', u'out', u'over', u'own', 
    u'r', u're', 
    u's', 'said', u'same', u'she', u'should', u'shouldnt', u'so', u'some', u'such', 
    u't', u'than', u'that', 'thats', u'the', u'their', u'theirs', u'them', u'themselves', u'then', u'there', u'these', u'they', u'this', u'those', u'through', u'to', u'too', 
    u'under', u'until', u'up', 
    u'very', 
    u'was', u'wasnt', u'we', u'were', u'werent', u'what', u'when', u'where', u'which', u'while', u'who', u'whom', u'why', u'will', u'with', u'wont', u'would', 
    u'y', u'you', u'your', u'yours', u'yourself', u'yourselves']
    
    # Custom List of Stopwords - Add your own here
    stopwords_custom = ['']
    stopwords = stopwords_core + stopwords_custom
    stopwords = [word.lower() for word in stopwords]    
    
    text_out = [re.sub('[^a-zA-Z0-9]','',word) for word in words]                                       # Remove special characters
    text_out = [word.lower() for word in text_out if len(word)>2 and word.lower() not in stopwords]     # Remove stopwords and words under X length
    return text_out

udf_cleantext = udf(cleanup_text , ArrayType(StringType()))
clean_text = rawdata.withColumn("words", udf_cleantext(struct([rawdata[x] for x in rawdata.columns])))

### Embedings + LDA

here we create the features of each line and then make the LDA itself with k topics

In [79]:
# Term Frequency Vectorization  - Option 2 (CountVectorizer)    : 
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures", vocabSize = 1000)
cvmodel = cv.fit(clean_text)
featurizedData = cvmodel.transform(clean_text)

vocab = cvmodel.vocabulary
vocab_broadcast = sc.broadcast(vocab)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

# Generate 25 Data-Driven Topics:
lda = LDA(k=5, seed=123, optimizer="em", featuresCol="features")

ldamodel = lda.fit(rescaledData)

#model.isDistributed()
#model.vocabSize()

ldatopics = ldamodel.describeTopics()
#ldatopics.show(25)

def map_termID_to_Word(termIndices):
    words = []
    for termID in termIndices:
        words.append(vocab_broadcast.value[termID])
    
    return words

udf_map_termID_to_Word = udf(map_termID_to_Word , ArrayType(StringType()))
ldatopics_mapped = ldatopics.withColumn("topic_desc", udf_map_termID_to_Word(ldatopics.termIndices))

[nltk_data] Downloading package punkt to /home/hadoop/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/hadoop/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Show topics

In [None]:
ldatopics_mapped.select(ldatopics_mapped.topic, ldatopics_mapped.topic_desc).show(50,False)

### Add detected topic to each line

In [54]:
ldaResults = ldamodel.transform(rescaledData)

ldaResults.select('all','words','features','topicDistribution').show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|               title|             content|               words|            features|   topicDistribution|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|House Republicans...|WASHINGTON  —   C...|[house, republica...|(1000,[0,1,2,20,2...|[0.10604047711236...|
|Rift Between Offi...|After the bullet ...|[rift, officers, ...|(1000,[0,1,2,31,3...|[0.09065335868619...|
|Tyrus Wong, ‘Bamb...|When Walt Disney’...|[tyrus, wong, bam...|(1000,[0,1,2,11,7...|[0.08040092480352...|
|Among Deaths in 2...|Death may be the ...|[among, deaths, 2...|(1000,[0,1,2,208,...|[0.06378516439272...|
|Kim Jong-un Says ...|SEOUL, South Kore...|[kim, jongun, say...|(1000,[0,1,2,7,21...|[0.08504651510476...|
|Sick With a Cold,...|LONDON  —   Queen...|[sick, cold, quee...|(1000,[0,1,2,81,6...|[0.22573492592097...|
|Taiwan’s Presiden...|BEIJING  —   Pr