# Sparkify Project Workspace
This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

In [379]:
from datetime import datetime

from pyspark.sql import SparkSession

from pyspark.sql.functions import isnan, when, first, avg, last, count, countDistinct, col, min as smin, max as smax, sum as ssum, lag, lead, coalesce, lit

from pyspark.sql.window import Window

In [2]:
!ls

2m_rows_sparkify_event_data.csv  README.MD
b_1_sparkify_event_data.csv	 requirements.txt
ETL.ipynb			 sparkify_event_data.json
full_sparkify_event_data.json	 Sparkify.ipynb
medium_sparkify_event_data.csv


In [140]:
filepath = 'medium_sparkify_event_data.json'

In [3]:
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

spark.sparkContext.setLogLevel('DEBUG')

# Load and Clean Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [233]:
# df = spark.read.option("inferSchema", "true").option("header", "true").option("encoding", "utf-8").csv(filepath)
df = spark.read.option("inferSchema", "true").option("header", "true").option("encoding", "utf-8").json(filepath)

In [142]:
df.dtypes

[('artist', 'string'),
 ('auth', 'string'),
 ('firstName', 'string'),
 ('gender', 'string'),
 ('itemInSession', 'bigint'),
 ('lastName', 'string'),
 ('length', 'double'),
 ('level', 'string'),
 ('location', 'string'),
 ('method', 'string'),
 ('page', 'string'),
 ('registration', 'bigint'),
 ('sessionId', 'bigint'),
 ('song', 'string'),
 ('status', 'bigint'),
 ('ts', 'bigint'),
 ('userAgent', 'string'),
 ('userId', 'string')]

In [143]:
df.show(n=2, truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------------
 artist        | Martin Orford                                                                                                              
 auth          | Logged In                                                                                                                  
 firstName     | Joseph                                                                                                                     
 gender        | M                                                                                                                          
 itemInSession | 20                                                                                                                         
 lastName      | Morales                                                                                                                    
 length      

In [144]:
df.columns

['artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level',
 'location',
 'method',
 'page',
 'registration',
 'sessionId',
 'song',
 'status',
 'ts',
 'userAgent',
 'userId']

In [19]:
not_na_columns = [ 'userId', 'sessionId' ]

In [145]:
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+
|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId|song|status| ts|userAgent|userId|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+
|     0|   0|        0|     0|            0|       0|     0|    0|       0|     0|   0|           0|        0|   0|     0|  0|        0|     0|
+------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+----+------+---+---------+------+



In [146]:
df.groupBy('userId').count().orderBy('count', ascending = False).show(50)

+------+-----+
|userId|count|
+------+-----+
|      |15700|
|    92| 9767|
|   140| 7448|
|300049| 7309|
|   101| 6842|
|300035| 6810|
|   195| 6184|
|   230| 6019|
|   163| 5965|
|   250| 5678|
|    18| 5511|
|   276| 5346|
|300017| 5266|
|    87| 5243|
|   293| 5125|
|300021| 5076|
|    42| 4952|
|300011| 4816|
|    30| 4737|
|    12| 4232|
|300031| 4194|
|   126| 4190|
|   283| 4181|
|   228| 4092|
|   100| 3999|
|   259| 3633|
|   105| 3597|
|   246| 3566|
|   121| 3541|
|   269| 3511|
|   292| 3504|
|    70| 3465|
|    35| 3456|
|    38| 3211|
|    98| 3206|
|   282| 3191|
|   185| 3088|
|300023| 3018|
|100009| 2987|
|   157| 2966|
|200023| 2955|
|   104| 2950|
|   174| 2917|
|   225| 2849|
|300038| 2829|
|   172| 2728|
|    85| 2696|
|   258| 2684|
|200020| 2654|
|   179| 2639|
+------+-----+
only showing top 50 rows



In [147]:
df.select("artist").distinct().count()

21248

In [148]:
df.select('length').describe().show()

+-------+------------------+
|summary|            length|
+-------+------------------+
|  count|            432877|
|   mean|248.66459278007508|
| stddev| 98.41266955052019|
|    min|           0.78322|
|    max|        3024.66567|
+-------+------------------+



In [149]:
print(f'Rows before: {df.count()}')

df = df.where(df.userId != '')

print(f'Rows after: {df.count()}')

Rows before: 543705
Rows after: 528005


# Exploratory Data Analysis
When you're working with the full dataset, perform EDA by loading a small subset of the data and doing basic manipulations within Spark. In this workspace, you are already provided a small subset of data you can explore.

### Define Churn

Once you've done some preliminary analysis, create a column `Churn` to use as the label for your model. I suggest using the `Cancellation Confirmation` events to define your churn, which happen for both paid and free users. As a bonus task, you can also look into the `Downgrade` events.

### Explore Data
Once you've defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

In [150]:
df.groupBy('page').count().orderBy('count', ascending = False).show(50)

+--------------------+------+
|                page| count|
+--------------------+------+
|            NextSong|432877|
|           Thumbs Up| 23826|
|                Home| 19089|
|     Add to Playlist| 12349|
|          Add Friend|  8087|
|         Roll Advert|  7773|
|              Logout|  5990|
|         Thumbs Down|  4911|
|           Downgrade|  3811|
|            Settings|  2964|
|                Help|  2644|
|               About|  1026|
|             Upgrade|   968|
|       Save Settings|   585|
|               Error|   503|
|      Submit Upgrade|   287|
|    Submit Downgrade|   117|
|              Cancel|    99|
|Cancellation Conf...|    99|
+--------------------+------+



Some questions about the data:

- Are errors related to downgrading canceling the service?
- Having a certain number of friends or a sense of community can decrease the churn?
- Thumbs down are related to churn? (could the quality of the songs catalog affect the churn)
- The advertising is not annoying the users?
- Users with stay connected for more time have less change to churn?
- Is the home page relevant?
- Users, who access the downgrade page are how much more willing to churn?

In [151]:
df.groupBy('status').count().orderBy('count', ascending = False).show(20)

+------+------+
|status| count|
+------+------+
|   200|483600|
|   307| 43902|
|   404|   503|
+------+------+



In [152]:
df.filter('userId = 92').groupBy('page').count().orderBy('count', ascending = False).show(50)

+----------------+-----+
|            page|count|
+----------------+-----+
|        NextSong| 8177|
|       Thumbs Up|  400|
|            Home|  308|
| Add to Playlist|  248|
|      Add Friend|  158|
|          Logout|   96|
|       Downgrade|   85|
|     Thumbs Down|   80|
|            Help|   62|
|     Roll Advert|   60|
|        Settings|   48|
|           About|   17|
|           Error|   13|
|         Upgrade|    7|
|   Save Settings|    5|
|  Submit Upgrade|    2|
|Submit Downgrade|    1|
+----------------+-----+



In [154]:
df.filter('userId = 92').groupBy('page').count().orderBy('count', ascending = False).show(50)

+----------------+-----+
|            page|count|
+----------------+-----+
|        NextSong| 8177|
|       Thumbs Up|  400|
|            Home|  308|
| Add to Playlist|  248|
|      Add Friend|  158|
|          Logout|   96|
|       Downgrade|   85|
|     Thumbs Down|   80|
|            Help|   62|
|     Roll Advert|   60|
|        Settings|   48|
|           About|   17|
|           Error|   13|
|         Upgrade|    7|
|   Save Settings|    5|
|  Submit Upgrade|    2|
|Submit Downgrade|    1|
+----------------+-----+



In [155]:
df.filter('userId = 92').groupBy('userAgent').count().orderBy('count', ascending = False).show(50)

+--------------------+-----+
|           userAgent|count|
+--------------------+-----+
|"Mozilla/5.0 (iPa...| 9767|
+--------------------+-----+



In [156]:
df.filter('userId = 92 and song != \'null\' ').groupBy('song').count().orderBy('count', ascending = False).show(50)

+--------------------+-----+
|                song|count|
+--------------------+-----+
|      You're The One|   49|
|                Undo|   34|
|             Revelry|   28|
|Horn Concerto No....|   27|
|    Ain't Misbehavin|   23|
|             Secrets|   20|
|             Invalid|   20|
|Dog Days Are Over...|   19|
|            Tive Sim|   18|
|            Marry Me|   16|
|              Canada|   16|
|        Use Somebody|   16|
|       ReprÃÂ©sente|   15|
|    Bring Me To Life|   14|
|  Sayonara-Nostalgia|   14|
|       Sehr kosmisch|   14|
|         Bulletproof|   13|
|Catch You Baby (S...|   13|
| I CAN'T GET STARTED|   12|
|                Home|   11|
|              Yellow|   10|
|Don't Stop The Music|   10|
|    Hey_ Soul Sister|   10|
|         The Maestro|   10|
|          Kryptonite|   10|
|Make Love To Your...|   10|
|           Fireflies|   10|
|SinceritÃÂ© Et J...|    9|
|    Times Like These|    9|
|      Drop The World|    9|
|        Day 'N' Nite|    9|
|         Bubb

In [257]:
w_session = Window.partitionBy('sessionId').orderBy('ts')

df.filter('page = "Cancellation Confirmation"').select('ts', 'sessionId', 'itemInSession', 'userId', 'last_page_on_session', 'page').show() 

+-------------+---------+-------------+------+--------------------+--------------------+
|           ts|sessionId|itemInSession|userId|last_page_on_session|                page|
+-------------+---------+-------------+------+--------------------+--------------------+
|1538987586000|     1010|            7|    54|              Cancel|Cancellation Conf...|
|1543297283000|     4473|            5|   167|              Cancel|Cancellation Conf...|
|1540103857000|      237|          239|100014|              Cancel|Cancellation Conf...|
|1542055421000|     3368|           59|   162|              Cancel|Cancellation Conf...|
|1538678296000|      130|           55|100044|              Cancel|Cancellation Conf...|
|1541147487000|     2575|           41|    16|              Cancel|Cancellation Conf...|
|1540809016000|     2643|           83|   172|              Cancel|Cancellation Conf...|
|1542443637000|     3791|           47|   120|              Cancel|Cancellation Conf...|
|1543440994000|     4

In [157]:
df.select(['ts', 'page', 'sessionId', 'itemInSession', 'song', 'artist']).filter('userId = 92').orderBy('ts', ascending = False).show(100)

+-------------+---------------+---------+-------------+--------------------+--------------------+
|           ts|           page|sessionId|itemInSession|                song|              artist|
+-------------+---------------+---------+-------------+--------------------+--------------------+
|1543615837000|     Add Friend|     4790|           86|                null|                null|
|1543615836000|           Home|     4790|           85|                null|                null|
|1543615832000|       NextSong|     4790|           84|           Fireflies|  Charttraxx Karaoke|
|1543615597000|       NextSong|     4790|           83|    Music Of The Sun|             Rihanna|
|1543615053000|       NextSong|     4790|           82|Kun Puut Tekee Se...|Scandinavian Musi...|
|1543614872000|       NextSong|     4790|           81|             The Sun|   Portugal. The Man|
|1543614674000|       NextSong|     4790|           80|             Banquet|          Bloc Party|
|1543614504000|     

In [73]:
df_ts = df.filter('userId = 1333174').select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).rdd.flatMap(lambda x: x).collect()

In [82]:
def to_datetime(milliseconds, dt_format = '%Y-%m-%d %H:%M:%S'):
    return datetime.fromtimestamp(milliseconds / 1000).strftime(dt_format)  

list(map(to_datetime, df_ts))

['2018-09-30 23:46:29', '2018-10-07 03:53:38']

In [128]:
list(map(to_datetime, df.select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).rdd.flatMap(lambda x: x).collect()))

['2018-09-30 21:00:01', '2018-10-07 06:44:26']

In [85]:
df.groupBy('location').count().orderBy('count', ascending = False).show(50, False)

+----------------------------------------------+------+
|location                                      |count |
+----------------------------------------------+------+
|New York-Newark-Jersey City, NY-NJ-PA         |125507|
|Los Angeles-Long Beach-Anaheim, CA            |91139 |
|null                                          |64332 |
|Chicago-Naperville-Elgin, IL-IN-WI            |59171 |
|Philadelphia-Camden-Wilmington, PA-NJ-DE-MD   |45041 |
|Washington-Arlington-Alexandria, DC-VA-MD-WV  |43905 |
|Dallas-Fort Worth-Arlington, TX               |42425 |
|Miami-Fort Lauderdale-West Palm Beach, FL     |41035 |
|Atlanta-Sandy Springs-Roswell, GA             |36940 |
|Houston-The Woodlands-Sugar Land, TX          |35165 |
|Phoenix-Mesa-Scottsdale, AZ                   |33490 |
|Boston-Cambridge-Newton, MA-NH                |32895 |
|Riverside-San Bernardino-Ontario, CA          |32259 |
|San Francisco-Oakland-Hayward, CA             |31036 |
|Seattle-Tacoma-Bellevue, WA                   |

# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

In [225]:
df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [165]:
df.filter('page = "Cancellation Confirmation"').show(50)

+------+---------+---------+------+-------------+----------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|artist|     auth|firstName|gender|itemInSession|  lastName|length|level|            location|method|                page| registration|sessionId|song|status|           ts|           userAgent|userId|
+------+---------+---------+------+-------------+----------+------+-----+--------------------+------+--------------------+-------------+---------+----+------+-------------+--------------------+------+
|  null|Cancelled|   Olivia|     F|           40|      Carr|  null| free|      Fort Wayne, IN|   GET|Cancellation Conf...|1536758439000|      490|null|   200|1538400616000|Mozilla/5.0 (Wind...|   208|
|  null|Cancelled|  Lillian|     F|          234|   Cameron|  null| paid|        Columbus, OH|   GET|Cancellation Conf...|1533472700000|      471|null|   200|1538482793000|Mozilla/5.0 (Wind...|   

In [338]:
ts_events = df.select([smin('ts').alias('min_ts'), smax('ts').alias('max_ts')]).collect()[0]
min_ts = ts_events[0]
max_ts = ts_events[1]
print(f'Min. ts: {min_ts}')
print(f'Max. ts: {max_ts}')
 
w_session = Window.partitionBy('sessionId').orderBy('ts')
w_user_session = Window.partitionBy('sessionId', 'userId').orderBy('ts').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
w_user = Window.partitionBy('userId').orderBy('ts').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

Min. ts: 1538352011000
Max. ts: 1543622466000


In [273]:
CHURN_CANCELLATION_PAGE = 'Cancellation Confirmation'
REGISTRATION_PAGE = 'Submit Registration'

In [284]:
df = df.withColumn('previous_page', lag(df.page).over(w_session))
df = df.withColumn('last_event_ts', last(col('ts')).over(w_user))
df = df.withColumn('last_page', last(col('page')).over(w_user))
df = df.withColumn('cancellation_ts', when(df.last_page == CHURN_CANCELLATION_PAGE, df.last_event_ts).otherwise(max_ts))
df = df.withColumn('register_page', first(col('previous_page')).over(w_user))
df = df.withColumn('first_ts', first(col('ts')).over(w_user))
df = df.withColumn('registration_ts', when(df.register_page == REGISTRATION_PAGE, df.first_ts).otherwise(min_ts))

In [345]:
df = df.withColumn('ts_elapsed', last(df.ts).over(w_session) - first(df.ts).over(w_user_session))
df = df.withColumn('session_duration', smax(df.ts_elapsed).over(w_user_session))

In [458]:
# sessions from the user
df_sessions = df.where(df.userId == 100010).groupBy('sessionId', 'userId').agg(
    smax(df.ts).alias('s_ts'),
    smin(df.ts).alias('e_ts')
) 

w_user_sessions_interval = Window.partitionBy('userId').orderBy('s_ts')

# Calculate the interval until the next session
df_sessions = df_sessions.withColumn('interval_to_session', col('s_ts') - lag(col('e_ts')).over(w_user_sessions_interval))

df_sessions.show()

+---------+------+-------------+-------------+-------------------+
|sessionId|userId|         s_ts|         e_ts|interval_to_session|
+---------+------+-------------+-------------+-------------------+
|       62|100010|1539004746000|1538991392000|               null|
|      166|100010|1539254318000|1539242427000|          262926000|
+---------+------+-------------+-------------+-------------------+



In [466]:
milliseconds_to_hours = 3600 * 1000

# We should remove the null lines before count/group to not account 2 times the mean interval
df_sessions.where(df_sessions.interval_to_session.isNotNull()).groupBy('userId').agg(
        (avg(df_sessions.interval_to_session) / milliseconds_to_hours).alias('avg_interval_between_sessions')
    ).show()

+------+-----------------------------+
|userId|avg_interval_between_sessions|
+------+-----------------------------+
|100010|                       73.035|
+------+-----------------------------+



In [468]:
_columns = [
    # 'userId', 
    'level',
    'sessionId', 
    'itemInSession', 
    'ts_elapsed',
    'session_duration',
    'registration_ts', 
    'page',
    'previous_page',
    'register_page', 
    'ts', 
    'first_ts', 
    'last_event_ts', 
    'last_page', 
    'cancellation_ts'
]

df.select(_columns).where(df.userId == 100010).orderBy(df.ts).show(350, False)

+-----+---------+-------------+----------+----------------+---------------+-------------------------+---------------+-------------+-------------+-------------+-------------+-------------------------+---------------+
|level|sessionId|itemInSession|ts_elapsed|session_duration|registration_ts|page                     |previous_page  |register_page|ts           |first_ts     |last_event_ts|last_page                |cancellation_ts|
+-----+---------+-------------+----------+----------------+---------------+-------------------------+---------------+-------------+-------------+-------------+-------------+-------------------------+---------------+
|free |62       |0            |0         |13354000        |1538352011000  |NextSong                 |Home           |Home         |1538991392000|1538991392000|1539254318000|Cancellation Confirmation|1539254318000  |
|free |62       |1            |226000    |13354000        |1538352011000  |NextSong                 |NextSong       |Home         |15389

- Advertises number (per session and all)
    - The user **100010** returned after some idle time and received a considerable amount of advertises;
    - Also, after thumbs down, I received two advertisements on four sounds. Then canceled the service.
- Number of sessions

In [279]:
df.filter('userId = 100010').groupBy('sessionId').count().orderBy('count', ascending = False).show(120)

+---------+-----+
|sessionId|count|
+---------+-----+
|       62|   69|
|      166|   68|
+---------+-----+



In [354]:
TRUE = 1
FALSE = 0

milliseconds_to_hours = 3600 * 1000

df_user_profile = df.filter('userId = 100010').groupby('userId')\
    .agg( 

        first(when(col('gender') == 'M', TRUE).otherwise(FALSE)).alias('male'),

        smin(col('first_ts')).alias('ts_start'),
        smax(col('last_event_ts')).alias('ts_end'),

        # Subscription
        count(when(col('page') == 'Submit Downgrade', True)).alias("n_downgrades"),
        count(when(col('page') == 'Submit Upgrade', True)).alias("n_upgrades"),
        last(when(col('level') == 'paid', TRUE).otherwise(FALSE)).alias('paid_user'),
    
        # Streaming
        count(when(col('page') == 'NextSong', True)).alias("n_songs_played"),
        count(when(col('page') == 'Thumbs Up', True)).alias("n_likes"),
        count(when(col('page') == 'Thumbs Down', True)).alias("n_dislikes"),
        countDistinct(col('sessionId')).alias("n_sessions"),
        (avg(col('session_duration')) / milliseconds_to_hours).alias("avg_session_duration"),
    
        # Community
        count(when(col('page') == 'Add Friend', True)).alias("n_friends"),
        count(when(col('page') == 'Add to Playlist', True)).alias("n_playlist_added"),
    
        # Other
        count(when(col('page') == 'Home', True)).alias("n_home"),
        count(when(col('page') == 'Roll Advert', True)).alias("n_advertises"),
        count(when(col('page') == 'Help', True)).alias("n_help"),
        count(when(col('page') == 'Error', True)).alias("n_errors"),
        count(when(col('page') == 'Settings', True)).alias("n_settings"),
        count(col('page')).alias("n_all_interactions")
    )

df_user_profile.show()

+------+----+-------------+-------------+------------+----------+---------+--------------+-------+----------+----------+--------------------+---------+----------------+------+------------+------+--------+----------+------------------+
|userId|male|     ts_start|       ts_end|n_downgrades|n_upgrades|paid_user|n_songs_played|n_likes|n_dislikes|n_sessions|avg_session_duration|n_friends|n_playlist_added|n_home|n_advertises|n_help|n_errors|n_settings|n_all_interactions|
+------+----+-------------+-------------+------------+----------+---------+--------------+-------+----------+----------+--------------------+---------+----------------+------+------------+------+--------+----------+------------------+
|100010|   0|1538991392000|1539254318000|           0|         0|        0|            96|      4|         3|         2|   3.507733171127332|        3|               1|     3|          22|     0|       0|         0|               137|
+------+----+-------------+-------------+------------+------

In [352]:
dir(avg)

['__annotations__',
 '__call__',
 '__class__',
 '__closure__',
 '__code__',
 '__defaults__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__get__',
 '__getattribute__',
 '__globals__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__kwdefaults__',
 '__le__',
 '__lt__',
 '__module__',
 '__name__',
 '__ne__',
 '__new__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.