### 01 Develop Analysis Workflow

This notebook aims to develop and test the Spark workflow before implementing everything into dedicated Python files for running on the cluster.

In [23]:
from pathlib import Path
import pandas as pd

Although the intention here is to practice Spark and distributed computing, let's first look at the data in Pandas and use as a sanity check with Spark dataframes.

In [24]:
# Limit how many rows to import for speed
nrows = 100_000

In [25]:
questions = pd.read_csv(Path('./assets/Questions2.csv'), nrows=nrows,
                encoding="ISO-8859-1").dropna(subset=["Id", "Body", "CreationDate"])
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            100000 non-null  int64  
 1   OwnerUserId   95342 non-null   float64
 2   CreationDate  100000 non-null  object 
 3   ClosedDate    3331 non-null    object 
 4   Score         100000 non-null  int64  
 5   Title         100000 non-null  object 
 6   Body          100000 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 5.3+ MB


In [26]:
questions.sort_values(by="Id").head(5)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [27]:
answers = pd.read_csv(Path('./assets/Answers.csv'), nrows=nrows,
                encoding="ISO-8859-1").dropna()
answers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96675 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            96675 non-null  int64  
 1   OwnerUserId   96675 non-null  float64
 2   CreationDate  96675 non-null  object 
 3   ParentId      96675 non-null  int64  
 4   Score         96675 non-null  int64  
 5   Body          96675 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.2+ MB


In [28]:
answers.head(5)

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
1,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...
2,199,50.0,2008-08-01T19:36:46Z,180,1,<p>I've read somewhere the human eye can't dis...
3,269,91.0,2008-08-01T23:49:57Z,260,4,"<p>Yes, I thought about that, but I soon figur..."
4,307,49.0,2008-08-02T01:49:46Z,260,28,"<p><a href=""http://www.codeproject.com/Article..."


In [29]:
tags = pd.read_csv(Path('./assets/Tags.csv'),
                encoding="ISO-8859-1").dropna()
tags.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3749881 entries, 0 to 3750993
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 85.8+ MB


In [30]:
tags.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


### PySpark: Preprocess

Now let's jump into Spark. We'll start by reading in the csv files. Note that we set a row limit to minimize computations here. When we're ready to run everything on the cluster, the limit will be removed.

In [31]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, LongType
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("StackOverflow").getOrCreate()

In [32]:
df_questions = (spark.read.options(encoding="ISO-8859-1",
                header=True, multiLine=False, mode="DROPMALFORMED")
                .csv("./assets/Questions.csv")
                .limit(100_000)
                ).cache()

df_answers = (spark.read.options(encoding="ISO-8859-1", 
                header=True, mode="DROPMALFORMED", multiLine=False)
                .csv('./assets/Answers.csv')
                .limit(100_000)
                ).cache()

df_tags = (spark.read.options(encoding="ISO-8859-1", 
            header=True, mode="DROPMALFORMED", multiLine=False)
            .csv('./assets/Tags.csv')
            .limit(100_000)
            ).cache()

23/03/24 13:09:32 WARN CacheManager: Asked to cache already cached data.
23/03/24 13:09:32 WARN CacheManager: Asked to cache already cached data.
23/03/24 13:09:32 WARN CacheManager: Asked to cache already cached data.


In [33]:
# Confirm row limits
print(df_questions.count(), df_answers.count(), df_tags.count())

[Stage 10:>                                                         (0 + 8) / 8]

100000 100000 100000


                                                                                

In [34]:
# Inspect Schemas
df_questions.printSchema()
df_answers.printSchema()
df_tags.printSchema()

root
 |-- Id: string (nullable = true)
 |-- OwnerUserId: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- ClosedDate: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Body: string (nullable = true)

root
 |-- Id: string (nullable = true)
 |-- OwnerUserId: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- ParentId: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Body: string (nullable = true)

root
 |-- Id: string (nullable = true)
 |-- Tag: string (nullable = true)



In [35]:
# Convert datatypes and add column for finding time differences

df_quest_filt = (df_questions
                .withColumn('Id', df_questions['Id'].cast(IntegerType()))
                .withColumn('OwnerUserId', df_questions['OwnerUserId'].cast(IntegerType()))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'T', ' '))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'Z', ''))
                .withColumn('ClosedTime', F.unix_timestamp('ClosedDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ElapsedTime', (F.col('ClosedTime') - F.col('CreationTime')))
                .withColumn('Score', df_questions['Score'].cast(IntegerType()))
                ).na.drop()

df_answers_filt = (df_answers
                    .withColumn('Id', df_answers['Id'].cast(IntegerType()))
                    .withColumn('OwnerUserId', df_answers['OwnerUserId'].cast(IntegerType()))
                    .withColumn('ParentId', df_answers['ParentId'].cast(IntegerType()))
                    .withColumn('Score', df_answers['Score'].cast(IntegerType()))
                    .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                    .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                    .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                    ).na.drop()

df_tags_filt = (df_tags
                .withColumn('Id', df_tags['Id'].cast(IntegerType()))
                ).na.drop()        

In [36]:
# Inspect counts after dropping nulls
print(df_quest_filt.count(), df_answers_filt.count(), df_tags_filt.count())

                                                                                

3078 93357 100000


Around 97% of questions are eliminated after dropping nulls! This is most likely due to nulls in question's "ClosedDate" column. This also indicates that taking the time difference between a question's CreationDate and ClosedDate will not be a good indicator. Let's try a different approach to removing nulls.

In [37]:
df_quest_filt = (df_questions
                .withColumn('Id', df_questions['Id'].cast(IntegerType()))
                .withColumn('OwnerUserId', df_questions['OwnerUserId'].cast(IntegerType()))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'T', ' '))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'Z', ''))
                .withColumn('ClosedTime', F.unix_timestamp('ClosedDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('Score', df_questions['Score'].cast(IntegerType()))
                ).na.drop(subset=["Id", "Body", "CreationTime"])

In [38]:
df_quest_filt.count()

97670

In [51]:
df_quest_filt.sort("Id", ascending=True).show(10)

[Stage 37:>                                                         (0 + 1) / 1]

+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| Id|OwnerUserId|       CreationDate|         ClosedDate|Score|               Title|                Body|CreationTime|ClosedTime|
+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| 80|         26|2008-08-01 13:57:07|                 NA|   26|SQLStatement.exec...|"<p>I've written ...|  1217613427|      null|
| 90|         58|2008-08-01 14:41:24|2012-12-26 03:45:49|  144|Good branching an...|"<p>Are there any...|  1217616084|1356511549|
|120|         83|2008-08-01 15:50:08|                 NA|   21|   ASP.NET Site Maps|<p>Has anyone got...|  1217620208|      null|
|180|    2089740|2008-08-01 18:42:19|                 NA|   53|Function for crea...|<p>This is someth...|  1217630539|      null|
|260|         91|2008-08-01 23:22:08|                 NA|   49|Adding scripting ...|<p>I h

                                                                                

This approach looks much better. Let's start exploring the data.

TODO
Guiding questions brainstorm:
- What are top ten most commmon tags?
- Predict if a question contains a top 10 tag from the body text?
- Predict score based on text?
- Predict how long a question will take to answer?
- Relationship between score and how long question takes to answer?

### 1. Most Common Tags
What are the most common tags?

In [59]:
tags_count = (df_tags_filt
            .groupBy('Tag')
            .count()
            .sort('count', ascending=False)
            .limit(10)
            )

In [60]:
tags_count.show()

+----------+-----+
|       Tag|count|
+----------+-----+
|        c#| 4657|
|      .net| 2628|
|      java| 2436|
|   asp.net| 2143|
|       php| 1884|
|javascript| 1799|
|       c++| 1686|
|    python| 1329|
|    jquery| 1249|
|       sql| 1234|
+----------+-----+



Let's save these top 10 as a list.

In [63]:
df = tags_count.select('Tag').toPandas()
most_common_tags = df['Tag'].to_list()
print(most_common_tags)

['c#', '.net', 'java', 'asp.net', 'php', 'javascript', 'c++', 'python', 'jquery', 'sql']


In [41]:
# spark.stop()