# The Daily Show

In this usecase we will be taking a famous Tv show dataset i.e.,The Daily show and we will be performing analysis on the guests who came to the show.

**Dataset Description:**

- **YEAR** –  The year the episode aired

- **GoogleKnowlege_Occupation** -Their occupation or office, according to Google’s Knowledge Graph or, if they’re not in there, how Stewart introduced them on the program.

- **Show** – Air date of episode. Not unique, as some shows had more than one guest

- **Group** – A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under “politicians”

- **Raw_Guest_List** – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.   

## Initializing spark session

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
from pyspark.sql import SparkSession

In [6]:
spark = SparkSession.builder.appName('usecase_10').getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

### Daily Show Data

In [7]:
df = spark.read.format('csv').options(header=False, inferSchema=True).load('daily_show_guests')

In [8]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)



In [9]:
df.show(3)

+----+------------------+-------+------+---------------+
| _c0|               _c1|    _c2|   _c3|            _c4|
+----+------------------+-------+------+---------------+
|1999|             actor|1/11/99|Acting| Michael J. Fox|
|1999|          Comedian|1/12/99|Comedy|Sandra Bernhard|
|1999|television actress|1/13/99|Acting|  Tracey Ullman|
+----+------------------+-------+------+---------------+
only showing top 3 rows



In [10]:
columns = ['YEAR ','GoogleKnowlege_Occupation','Show','Group','Raw_Guest_List']

In [11]:
for i,col in enumerate(df.columns):
    df = df.withColumnRenamed(col,columns[i])

In [12]:
df.printSchema()

root
 |-- YEAR : integer (nullable = true)
 |-- GoogleKnowlege_Occupation: string (nullable = true)
 |-- Show: string (nullable = true)
 |-- Group: string (nullable = true)
 |-- Raw_Guest_List: string (nullable = true)



In [13]:
df.show(5)

+-----+-------------------------+-------+------+----------------+
|YEAR |GoogleKnowlege_Occupation|   Show| Group|  Raw_Guest_List|
+-----+-------------------------+-------+------+----------------+
| 1999|                    actor|1/11/99|Acting|  Michael J. Fox|
| 1999|                 Comedian|1/12/99|Comedy| Sandra Bernhard|
| 1999|       television actress|1/13/99|Acting|   Tracey Ullman|
| 1999|             film actress|1/14/99|Acting|Gillian Anderson|
| 1999|                    actor|1/18/99|Acting|David Alan Grier|
+-----+-------------------------+-------+------+----------------+
only showing top 5 rows



In [15]:
df.count()

2693

In [23]:
df.where("Show BETWEEN '1/11/99' AND '6/11/99'").count()

1786

In [24]:
from pyspark.sql.functions import desc,unix_timestamp, from_unixtime

### Problem Statement: Find the top 5 kinds of GoogleKnowlege_Occupation people gusted the show in a particular time period.

In [29]:
df2 = df.select('YEAR ','GoogleKnowlege_Occupation','Show','Group','Raw_Guest_List',from_unixtime(unix_timestamp('Show', 'MM/dd/yy')).alias('Show_date'))

In [30]:
df2.show(5)

+-----+-------------------------+-------+------+----------------+-------------------+
|YEAR |GoogleKnowlege_Occupation|   Show| Group|  Raw_Guest_List|          Show_date|
+-----+-------------------------+-------+------+----------------+-------------------+
| 1999|                    actor|1/11/99|Acting|  Michael J. Fox|1999-01-11 00:00:00|
| 1999|                 Comedian|1/12/99|Comedy| Sandra Bernhard|1999-01-12 00:00:00|
| 1999|       television actress|1/13/99|Acting|   Tracey Ullman|1999-01-13 00:00:00|
| 1999|             film actress|1/14/99|Acting|Gillian Anderson|1999-01-14 00:00:00|
| 1999|                    actor|1/18/99|Acting|David Alan Grier|1999-01-18 00:00:00|
+-----+-------------------------+-------+------+----------------+-------------------+
only showing top 5 rows



In [38]:
df2.where("Show_date BETWEEN '1999-01-11 00:00:00' AND '1999-06-11 00:00:00'").count()

75

In [43]:
df2.where("Show_date BETWEEN '1999-01-11 00:00:00' AND '1999-06-11 00:00:00'") \
   .groupBy('GoogleKnowlege_Occupation') \
    .count() \
    .orderBy(desc('count')) \
    .show(5)

+-------------------------+-----+
|GoogleKnowlege_Occupation|count|
+-------------------------+-----+
|                    actor|   29|
|                  actress|   20|
|                 comedian|    4|
|       television actress|    3|
|               film actor|    2|
+-------------------------+-----+
only showing top 5 rows



## Closing Spark Session

In [44]:
spark.stop()