## From CSV to Mysql

in this Notebook I'll be using Pyspark to: 

1 - read from the tweets csv file into a pyspark dataframe

2 - saving the pyspark dataframe into a Mysql table (raw data before map reduce)

3 - using pyspark to read from mysql table

4 - apply reduce and data transformation to the dataframe

5 - using pyspark for Saving resulting dataframe into Mongodb


jar file for establishing connection from pyspark to mysql has been downloaded and saved therefore pyspark has been run with following argument: pyspark --jars mysql-connector-j-8.1.0.jar


In [22]:
#create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
  .appName("MyApp") \
  .config("spark.jars",  "mysql-connector-j-8.1.0.jar") \
  .master("local")\
  .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
warnings.filterwarnings("ignore")

In [23]:
#pyspark read from csv
data = spark.read.csv("/user1/ProjectTweets.csv", inferSchema=True)
data.show()

                                                                                

+---+----------+--------------------+--------+---------------+--------------------+
|_c0|       _c1|                 _c2|     _c3|            _c4|                 _c5|
+---+----------+--------------------+--------+---------------+--------------------+
|  0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|  1|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|  2|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|  3|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|  4|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
|  5|1467811372|Mon Apr 06 22:20:...|NO_QUERY|       joy_wolf|@Kwesidei not the...|
|  6|1467811592|Mon Apr 06 22:20:...|NO_QUERY|        mybirch|         Need a hug |
|  7|1467811594|Mon Apr 06 22:20:...|NO_QUERY|           coZZ|@LOLTrish hey  lo...|
|  8|1467811795|Mon Apr 06 22:20:...|NO_QUERY|2Hood4Hollywood|@Tatiana_K nop

In [24]:
#pyspark write into Mysql table
# sql database is called Tweets and table is called Tweets, schema is already present in mysql (done through CLI)
data.write \
  .format("jdbc") \
  .mode("overwrite") \
  .option("url", "jdbc:mysql://localhost:3306/Tweets") \
  .option("dbtable", "Tweets") \
  .option("user", "root") \
  .option("password", "password") \
  .save()

                                                                                

In [25]:
#pyspark read from Mysql table we just inserted 
df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/Tweets") \
    .option("driver", "com.mysql.jdbc.Driver").option("dbtable", "Tweets") \
    .option("user", "root").option("password", "password").load()

df.show()

[Stage 17:>                                                         (0 + 1) / 1]

+---+----------+--------------------+--------+---------------+--------------------+
|_c0|       _c1|                 _c2|     _c3|            _c4|                 _c5|
+---+----------+--------------------+--------+---------------+--------------------+
|  0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|  1|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|  2|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|  3|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|  4|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
|  5|1467811372|Mon Apr 06 22:20:...|NO_QUERY|       joy_wolf|@Kwesidei not the...|
|  6|1467811592|Mon Apr 06 22:20:...|NO_QUERY|        mybirch|         Need a hug |
|  7|1467811594|Mon Apr 06 22:20:...|NO_QUERY|           coZZ|@LOLTrish hey  lo...|
|  8|1467811795|Mon Apr 06 22:20:...|NO_QUERY|2Hood4Hollywood|@Tatiana_K nop

                                                                                

#### Data engineering using pyspark

In [26]:
if data.count() > 1000000:
    print(f"{data.count()} ... That's a lot of Data!!")



1600000 ... That's a lot of Data!!


                                                                                

In [28]:
#check unique values for _c3
data.select('_c3').distinct().collect()
#field only has 1 value, dropping field

                                                                                

[Row(_c3='NO_QUERY')]

In [31]:
df = data.drop(df._c3)

In [32]:
#checking for duplicates
df.groupby("_c1").count().where("count > 1").show()



+----------+-----+
|       _c1|count|
+----------+-----+
|1468544973|    2|
|1690908358|    2|
|1834777946|    2|
|1882160717|    2|
|1965601765|    2|
|1982434182|    2|
|2002309001|    2|
|2190980212|    2|
|1685304801|    2|
|1686371908|    2|
|1957194329|    2|
|1969964899|    2|
|1974268607|    2|
|2056807406|    2|
|2063670799|    2|
|1556266702|    2|
|1752414405|    2|
|1824843992|    2|
|1881996107|    2|
|1983726537|    2|
+----------+-----+
only showing top 20 rows



                                                                                

In [33]:
duplicates = df.groupby("_c1").count().where("count > 1").drop("count")
print(f"Number of duplicates: {duplicates.count()}")



Number of duplicates: 1685


                                                                                

In [36]:
#show 1 duplicate example
df[df["_c1"] == 1983726537].show(truncate=False)



+------+----------+----------------------------+-------+---------------------------------------------------------------------------------------------+
|_c0   |_c1       |_c2                         |_c4    |_c5                                                                                          |
+------+----------+----------------------------+-------+---------------------------------------------------------------------------------------------+
|252393|1983726537|Sun May 31 13:42:57 PDT 2009|iargent|Should have gone on a bike ride today but never quite happened  Still enjoyed the sun though |
+------+----------+----------------------------+-------+---------------------------------------------------------------------------------------------+





In [38]:
df = df.dropDuplicates(['_c1'])
df.count() #checking how many values after dropping duplicates

                                                                                

1598315

In [39]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: long (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)



Dealing with timestamps

In [40]:
#example: 
df.first()["_c2"] #PDT stands for Pacific time zone

                                                                                

'Mon Apr 06 22:32:38 PDT 2009'

In [41]:
#let's check if all time stamps are in PDT
#if all strings have PDT in the timestamp this list should return empty
[x for x in df.rdd.toLocalIterator() if "PDT" not in x['_c2']]

[Stage 165:>                                                        (0 + 1) / 1]

[]

In [42]:
#all timestamps are PDT
from pyspark.sql.functions import to_timestamp
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") #Had to set as Legacy cause of error in 
#spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED") #for return to standard timeparser policy

Time_Format = "E MMM d HH:mm:ss z yyyy"
df = df.withColumn("Timestamp", to_timestamp(data["_c2"], Time_Format))
df = df.drop(data._c2)
df = df.drop(data._c0)
df.show()



+----------+--------------+--------------------+-------------------+
|       _c1|           _c4|                 _c5|          Timestamp|
+----------+--------------+--------------------+-------------------+
|1467860144|      Jana1976|@JonathanRKnight ...|2009-04-07 06:32:38|
|1467862225|         hdm42|@vjl also, your w...|2009-04-07 06:33:11|
|1467889791| jennhelvering|Just called Hills...|2009-04-07 06:40:33|
|1467898027|    twitrbug81|@JonathanRKnight ...|2009-04-07 06:42:49|
|1467904302| bsbnumber1fan|@nick_carter Aww ...|2009-04-07 06:44:34|
|1467928749|      calliott|is tireddddddd. w...|2009-04-07 06:51:26|
|1467946810|TheDarrenxshow|@ilovepie mines t...|2009-04-07 06:56:37|
|1467968979|     atothebed|@clarianne APRIL ...|2009-04-07 07:02:45|
|1467987384|   vardenrhode|Just published a ...|2009-04-07 07:08:02|
|1468005581|     Yahtzee27|@littrellfans Its...|2009-04-07 07:13:16|
|1468010346|  Kelsey_Leigh|Why does school t...|2009-04-07 07:14:43|
|1468038360|   serendipify|@deon -

                                                                                

In [44]:
#remove hashtags
from pyspark.sql import functions as f
df = df.withColumn("Text",f.regexp_replace("_c5","#([^\s]+)\s",""))

In [46]:
#remove mentions
df = df.withColumn("Text",f.regexp_replace("Text","@([^\s]+)\s",""))

In [47]:
df.show()



+----------+--------------+--------------------+-------------------+--------------------+
|       _c1|           _c4|                 _c5|          Timestamp|                Text|
+----------+--------------+--------------------+-------------------+--------------------+
|1467860144|      Jana1976|@JonathanRKnight ...|2009-04-07 06:32:38|I hate the limite...|
|1467862225|         hdm42|@vjl also, your w...|2009-04-07 06:33:11|also, your websit...|
|1467889791| jennhelvering|Just called Hills...|2009-04-07 06:40:33|Just called Hills...|
|1467898027|    twitrbug81|@JonathanRKnight ...|2009-04-07 06:42:49|Thought you were ...|
|1467904302| bsbnumber1fan|@nick_carter Aww ...|2009-04-07 06:44:34|Aww Nick!! I like...|
|1467928749|      calliott|is tireddddddd. w...|2009-04-07 06:51:26|is tireddddddd. w...|
|1467946810|TheDarrenxshow|@ilovepie mines t...|2009-04-07 06:56:37|mines too... I'm ...|
|1467968979|     atothebed|@clarianne APRIL ...|2009-04-07 07:02:45|APRIL 9TH ISN'T C...|
|146798738

                                                                                

In [52]:
df = df.drop(data._c5)
df = df.drop(data._c4)

In [53]:
df.show()



+----------+-------------------+--------------------+
|       _c1|          Timestamp|                Text|
+----------+-------------------+--------------------+
|1467860144|2009-04-07 06:32:38|I hate the limite...|
|1467862225|2009-04-07 06:33:11|also, your websit...|
|1467889791|2009-04-07 06:40:33|Just called Hills...|
|1467898027|2009-04-07 06:42:49|Thought you were ...|
|1467904302|2009-04-07 06:44:34|Aww Nick!! I like...|
|1467928749|2009-04-07 06:51:26|is tireddddddd. w...|
|1467946810|2009-04-07 06:56:37|mines too... I'm ...|
|1467968979|2009-04-07 07:02:45|APRIL 9TH ISN'T C...|
|1467987384|2009-04-07 07:08:02|Just published a ...|
|1468005581|2009-04-07 07:13:16|Its all good. Jus...|
|1468010346|2009-04-07 07:14:43|Why does school t...|
|1468038360|2009-04-07 07:23:22|- &quot;source sh...|
|1468070706|2009-04-07 07:33:20|Not this many files |
|1468071555|2009-04-07 07:33:35|OMG, you particip...|
|1468071701|2009-04-07 07:33:38|  &quot;Now, if w...|
|1468088102|2009-04-07 07:38

                                                                                