# Data Analysis

With the data preparation done, we can now start to answer the questions.

## Initialization & Loading

Once again, let's connect to the Spark master container and load the data:

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.types as T
import pyspark.sql.functions as F

In [2]:
spark = SparkSession.builder.appName('analysis').master('spark://spark:7077') \
    .getOrCreate();

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/09/11 20:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
data = spark.read.parquet('/app/files/treated_data')
data.createOrReplaceTempView('data')

                                                                                

As a sanity check, let's see if the data was loaded correctly:

In [4]:
data.show(10)

+----------+--------------------+--------------------+-----------------+--------------------+---+-------------+-----+
|      date|              userId|      conversationId|          channel|   favourite_pokemon|age|         city|botId|
+----------+--------------------+--------------------+-----------------+--------------------+---+-------------+-----+
|2021-03-12|0001a55b006e0bada...|e40467df-6f1f-4c4...|              sms|[fearow, florges,...| 38| campo grande| 1567|
|2021-04-19|00037981db6fa4f24...|31fea68a-a79d-445...|        instagram|           [ninjask]| 23|     brasilia| 1567|
|2021-04-15|0004013912834f551...|b5d4e68c-177d-4fe...|              sms|[carnivine, hippo...| 28|       palmas| 1567|
|2021-04-02|00052c60aafed1bee...|a4387219-be30-412...|         whatsapp|         [electrode]| 55|       suzano| 1567|
|2021-03-10|0007a057c85be431b...|7957d94d-1639-451...|facebook messeger|[barboach, minun,...| 39|    boa vista| 1567|
|2021-04-28|0007ebdf52111b7f1...|de116548-c058-48c...|fa

                                                                                

## Questions

### 1. What is the number of unique conversations per channel?

The answer for this question is given by a straightforward query, as we can see below:

In [5]:
spark.sql('SELECT channel, COUNT(DISTINCT conversationId) AS unique_conversations FROM data GROUP BY 1 ORDER BY 2 DESC').show()

+-----------------+--------------------+
|          channel|unique_conversations|
+-----------------+--------------------+
|         telegram|               10186|
|         whatsapp|               10043|
|        instagram|               10038|
|facebook messeger|                9954|
|              sms|                9775|
+-----------------+--------------------+



We note that the numbers might be *slightly* off due to the 4 interactions that were (deliberately) dropped when we cleaned the data. Also, even though we did a `COUNT DISTINCT` in our query, this is unnecessary (for this particular dataset),  since our dataset already has a unique conversationId across each row.

### 2. Which was the day with the most conversations?

We only need to change the `channel` column to `date` in the previous query to obtain the desired answer:

In [6]:
conversations_per_day = spark.sql('SELECT date, COUNT(conversationId) AS conversations FROM data GROUP BY 1 ORDER BY 2 DESC')
conversations_per_day.show(10)
conversations_per_day.coalesce(1).write.mode('overwrite').csv('/app/files/conversations_per_day', header=True)

+----------+-------------+
|      date|conversations|
+----------+-------------+
|2021-04-26|          869|
|2021-03-03|          861|
|2021-03-08|          859|
|2021-03-27|          856|
|2021-04-15|          852|
|2021-03-01|          852|
|2021-04-21|          848|
|2021-04-09|          847|
|2021-04-18|          846|
|2021-04-27|          845|
+----------+-------------+
only showing top 10 rows



### 3. Which city has the most unique users?

Once again, the query is straightforward:

In [7]:
unique_users_per_city = spark.sql('SELECT city, COUNT(DISTINCT userId) AS unique_users FROM data GROUP BY 1 ORDER BY 2 DESC')
unique_users_per_city.show(10)
unique_users_per_city.coalesce(1).write.mode('overwrite').csv('/app/files/unique_users_per_city', header=True)

+------------+------------+
|        city|unique_users|
+------------+------------+
|    joinvile|         524|
|  piracicaba|         523|
|ponta grossa|         518|
|      iguacu|         505|
|    paulista|         504|
|      maceio|         502|
|     barueri|         502|
|      olinda|         499|
|   cariacica|         497|
|     guaruja|         496|
+------------+------------+
only showing top 10 rows



Even though we can see the city with the most unique users, we are going to assign the top result to a variable (this makes the code reusable).

In [8]:
most_unique_users_city = spark.sql('SELECT city, COUNT(DISTINCT userId) AS unique_users FROM data GROUP BY 1 ORDER BY 2 DESC LIMIT 1')
city = most_unique_users_city.collect()[0]['city']

The last three questions are about Pokémon. Because we made the decision to turn the list of Pokémon into a single-column array, it is difficult to analyze the data as it is. 

However, we can [explode](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html) the data to make it easier to analyze.

In [9]:
pokemon = data.select('userId', F.explode('favourite_pokemon').alias('pokemon'), 'city', 'age')
pokemon.createOrReplaceTempView('pokemon')
pokemon.show(20)

+--------------------+-----------------+-------------+---+
|              userId|          pokemon|         city|age|
+--------------------+-----------------+-------------+---+
|0001a55b006e0bada...|           fearow| campo grande| 38|
|0001a55b006e0bada...|          florges| campo grande| 38|
|0001a55b006e0bada...|        sudowoodo| campo grande| 38|
|00037981db6fa4f24...|          ninjask|     brasilia| 23|
|0004013912834f551...|        carnivine|       palmas| 28|
|0004013912834f551...|       hippopotas|       palmas| 28|
|00052c60aafed1bee...|        electrode|       suzano| 55|
|0007a057c85be431b...|         barboach|    boa vista| 39|
|0007a057c85be431b...|            minun|    boa vista| 39|
|0007a057c85be431b...|   necrozma-ultra|    boa vista| 39|
|0007ebdf52111b7f1...|      minior-blue|foz do iguacu| 30|
|000b29832b77f1815...|  keldeo-resolute|       franca| 55|
|000f08196eb55c0ad...|   rapidash-galar|        serra| 49|
|000f08196eb55c0ad...|         shedinja|        serra| 4

Now the `pokemon` column has a single value but we have repeating `userId`s. This is by design: each row is a **individual** choice of favourite Pokémon for said user. This means that if some user *A* has listed 3 Pokémons as favourites (e.g. *X*, *Y*, *Z*), then we will have three different rows (`A-X`, `A-Y` and `A-Z`).

### 4. What are the top 5 Pokémon in the age range of 20-30 years?

With the above dataset in hand, the answer for this question is given by the following query:

In [10]:
spark.sql('SELECT pokemon, COUNT(*) as favourite FROM pokemon WHERE age BETWEEN 20 AND 30 GROUP BY 1 ORDER BY 2 DESC').show(10)

+-------------+---------+
|      pokemon|favourite|
+-------------+---------+
|      dewgong|       38|
|venusaur-mega|       38|
|       grotle|       37|
|    electrode|       37|
|      moltres|       36|
| yamask-galar|       36|
|    porygon-z|       35|
|      florges|       35|
|       starly|       35|
|      tympole|       35|
+-------------+---------+
only showing top 10 rows



### 5. List all the favourite Pokémon in the city with the most unique users.

The following query returns the top 10 Pokémon for the city with the most unique users.

In [11]:
spark.sql(f"SELECT pokemon, COUNT(*) as favourtie FROM pokemon WHERE city = '{city}' GROUP BY 1 ORDER BY 2 DESC").show(10) 

+--------------------+---------+
|             pokemon|favourtie|
+--------------------+---------+
|             blipbug|        5|
|               entei|        5|
|           volcarona|        4|
|           rillaboom|        4|
|              gligar|        4|
|           obstagoon|        4|
|    charizard-mega-x|        4|
|          galvantula|        4|
|             ninjask|        4|
|raticate-totem-alola|        4|
+--------------------+---------+
only showing top 10 rows



Since we have over 500 Pokémon listed for said city (as seen below), we will save all the records in a text file.

In [12]:
spark.sql(f"SELECT COUNT (DISTINCT pokemon) FROM pokemon WHERE city = '{city}'").show() 
city_pokemon = spark.sql(f"SELECT DISTINCT pokemon FROM pokemon WHERE city = '{city}' ORDER BY 1 ASC")
city_pokemon.coalesce(1).write.mode('overwrite').text('/app/files/city_with_most_unique_users_pokemon')

+-----------------------+
|count(DISTINCT pokemon)|
+-----------------------+
|                    589|
+-----------------------+



### 6. Considering only the city of São Paulo, list all the favourite Pokémon by age group and the number of times it appear (in the age group). Consider each age group as a group of 10 years (e.g. 10-19, 20-29, 30-39, etc).

To create the age groups, we could use SQL. However, it is easier to use (again) Spark UDFs.

In [13]:
age_groups = {
    (0,9): 'under 9',
    (10,19): '10-19',
    (20,29): '20-29',
    (30,39): '30-39',
    (40,49): '40-49',
    (50,59): '50-59',
    (60,69): '60-69',
    (70,79): '70-79',
    (80,89): '80-89',
    (90,99): '90-99',
    (100,109) : '100-109'   
}

In [14]:
@F.udf
def age_group(age):
    for rng in age_groups:
        if age >= rng[0] and age <= rng[1]:
            return age_groups[rng]

In [15]:
sao_paulo_pokemon = pokemon.select('pokemon', 'age', age_group('age').alias('age_group')).where("city = 'sao paulo'")
sao_paulo_pokemon.createOrReplaceTempView('sao_paulo_pokemon')
sao_paulo_pokemon.show()

+-----------------+---+---------+
|          pokemon|age|age_group|
+-----------------+---+---------+
|        terrakion| 51|    50-59|
|      mawile-mega| 51|    50-59|
|         barboach| 51|    50-59|
|     altaria-mega| 34|    30-39|
|pikachu-alola-cap| 30|    30-39|
|         dewpider| 46|    40-49|
|        charizard| 46|    40-49|
|  keldeo-resolute| 46|    40-49|
|          dugtrio| 24|    20-29|
|           baltoy| 41|    40-49|
|       fletchling| 41|    40-49|
|    minior-yellow| 52|    50-59|
|       crabrawler| 52|    50-59|
|           machop| 52|    50-59|
|       aromatisse| 29|    20-29|
|              mew| 52|    50-59|
|       jigglypuff| 52|    50-59|
|           kyurem| 52|    50-59|
|           lairon| 32|    30-39|
|             abra| 32|    30-39|
+-----------------+---+---------+
only showing top 20 rows



With the age groups created, we can answer the question asked:

In [16]:
sao_paulo_pokemon_age_group = spark.sql('SELECT age_group, pokemon, COUNT(*) AS favourite FROM sao_paulo_pokemon GROUP BY 1,2 ORDER BY 1 DESC, 3 DESC')
sao_paulo_pokemon_age_group.show(10)
sao_paulo_pokemon_age_group.coalesce(1).write.mode('overwrite').csv('/app/files/sao_paulo_pokemon_age_group', header=True)

+---------+------------------+---------+
|age_group|           pokemon|favourite|
+---------+------------------+---------+
|    50-59|           poliwag|        2|
|    50-59|           xerneas|        2|
|    50-59|           metapod|        2|
|    50-59|        crabrawler|        2|
|    50-59|            mudkip|        2|
|    50-59|          rolycoly|        1|
|    50-59|minior-blue-meteor|        1|
|    50-59|          sunflora|        1|
|    50-59|             gible|        1|
|    50-59|          blaziken|        1|
+---------+------------------+---------+
only showing top 10 rows



Since it is not clear how we must present the data, we will show here only the Pokémons that showed more than once:

In [17]:
spark.sql('SELECT age_group, pokemon, COUNT(*) AS favourite FROM sao_paulo_pokemon GROUP BY 1,2 HAVING favourite > 1 ORDER BY 1 DESC, 3 DESC').show(100)

+---------+--------------------+---------+
|age_group|             pokemon|favourite|
+---------+--------------------+---------+
|    50-59|             poliwag|        2|
|    50-59|             xerneas|        2|
|    50-59|             metapod|        2|
|    50-59|          crabrawler|        2|
|    50-59|              mudkip|        2|
|    40-49|              mothim|        3|
|    40-49|            ludicolo|        2|
|    40-49|               burmy|        2|
|    40-49|             linoone|        2|
|    40-49|            shellder|        2|
|    40-49|           gardevoir|        2|
|    40-49|            dhelmise|        2|
|    40-49|              vulpix|        2|
|    40-49|             hatenna|        2|
|    40-49|       blaziken-mega|        2|
|    30-39|           bounsweet|        3|
|    30-39|            dragalge|        2|
|    30-39|          aromatisse|        2|
|    30-39|            empoleon|        2|
|    30-39|              combee|        2|
|    30-39|

We now stop our program:

In [None]:
spark.stop()