# Dataset Exploration & Analysis using Spark

**IMPORTANTE**: Este es un Jupyter notebook complementario usado para explorar datos y validar resultados. Para la solucion oficial al Challenge referirse al notebook [challenge.ipynb](./challenge.ipynb).


## Pre-requisitos


1. Si Spark no está instalado, hacerlo siguiendo las instrucciones en el [sitio oficial de Spark](https://spark.apache.org/downloads.html)

2. Asegurarse de actualizar las variables de entorno `SPARK_HOME` y `PATH`
    ```bash
    export SPARK_HOME=/path/to/spark
    export PATH=$PATH:$SPARK_HOME/bin
    ```

3. Create a virtual environment (`.venv`) in the root directory and install all the project dependencies.
    ```sh
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
    ```
    
4. Download the data

    Descarga manualmente https://drive.google.com/file/d/1ig2ngoXFTxP5Pa8muXo02mDTFexZzsis/view?usp=sharing, y extrae del archivo `.zip` el json file.
    El archivo extraído debe ser copiado a la carpeta `data/`.


**IMPORTANTE**: No es necesario hacer **3** y **4** si ya lo hizo como parte del setup indicado en el notebook [challenge.ipynb](./challenge.ipynb)

## Initialization of Notebook and Variables

In [None]:
# enable the autoreload extension and configure it for automatic module's reload
%load_ext autoreload
%autoreload 2

In [None]:
file_path = "../data/farmers-protest-tweets-2021-2-4.json"

## Start Spark Session

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("Twitter Data Analysis") \
        .getOrCreate()

## Data Exploration

In [None]:
df = spark.read.json(file_path)

In [None]:
df.printSchema()

In [None]:
df.show()

In [None]:
df.createOrReplaceTempView("tweets")
spark.sql("SELECT * FROM tweets LIMIT 10").show()

In [None]:
df_mentioned = spark.sql("""
            SELECT id, retweetedTweet, content, mentioned.username AS username
            FROM tweets
            LATERAL VIEW explode(mentionedUsers) t AS mentioned
            WHERE mentioned.username = 'meenaharris'
            ORDER BY id
          """
          )

In [None]:
df_mentioned.show(n=10, truncate=False)

## Using Spark and SparkSQL

### Q1: Top 10 Dates with Most Tweets and the User with Most Tweets on Each Date

Approach: Using pySpark

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
from typing import List, Tuple
import datetime

spark = SparkSession.builder.appName("Twitter Data Analysis").getOrCreate()

def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:

    df = spark.read.json(file_path)

    tweets_by_date = df.withColumn("date", to_date(col("date"))).groupBy("date").count()

    top_dates = tweets_by_date.orderBy(col("count").desc()).limit(10).collect()

    results = []
    for row in top_dates:
        date = row["date"]

        top_user = df.filter(col("date") == date).groupBy("user.username").count().orderBy(col("count").desc()).first()
        results.append((date, top_user["username"]))

    return results

In [None]:
q1_time(file_path)

### Q2: Get top 10 emojis more used with its corresponding count

Approach: Using pySpark

In [None]:
import emoji
from typing import List, Tuple
from pyspark.sql.functions import udf, explode, col
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Twitter Data Analysis").getOrCreate()

def extract_emojis(text):
    emojis = emoji.emoji_list(text)
    return [e['emoji'] for e in emojis]

extract_emojis_udf = udf(extract_emojis, ArrayType(StringType()))


def q2_time(file_path: str) -> List[Tuple[str, int]]:
    df = spark.read.json(file_path)

    emojis_df = df.withColumn("emojis", extract_emojis_udf(col("content")))
    emojis_exploded = emojis_df.select(explode(col("emojis")).alias("emoji"))
    emoji_counts = emojis_exploded.groupBy("emoji").count().orderBy(col("count").desc()).limit(10)

    return [(row['emoji'], row['count']) for row in emoji_counts.collect()]

In [None]:
q2_time(file_path)

### Q3: Top 10 most influential historical users (username) based on the count of mentions (@) recorded by each of them

Approach: Simple query using SparkSQL

In [None]:
spark.sql("""
            SELECT mentioned.username AS username, COUNT(*) AS mentions_count
            FROM tweets
            LATERAL VIEW explode(mentionedUsers) t AS mentioned
            GROUP BY mentioned.username
            ORDER BY mentions_count DESC
          """
          ).show(n=10, truncate=False)