# Data Modeling - Project 1B
# Data Modeling with Apache Cassandra

## 1. Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.

They'd like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions, and wish to bring you on the project. My role is to create a database for this analysis. I'll be able to test my database by running queries given to me by the analytics team from Sparkify to create the results.

### 1.1 Project Overview
In this project, I'll apply what I've learned on data modeling with Apache Cassandra and complete an ETL pipeline using Python. To complete the project, I will need to model my data by creating tables in Apache Cassandra to run queries. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.

## 2. ETL Pipeline for Pre-Processing the Files

### 2.1 Import Python packages 

In [1]:
# Import Python packages 
import pandas as pd
import cassandra
import re
import os
import glob
import numpy as np
import json
import csv

### 2.2 Creating list of filepaths to process original event csv data files

In [2]:
# checking the current working directory
print('current working directory: ' + os.getcwd())

# Get the current folder and subfolder event data
filepath = os.getcwd() + '/event_data'

# create a list of files and collect each filepath
for root, dirs, files in os.walk(filepath):
# join the file path and roots with the subdirectories using glob
    file_path_list = glob.glob(os.path.join(root,'*'))
    print('%s files found in %s' % (len(file_path_list), filepath))
    # DEBUG: print list of files
#    for item in file_path_list:
#        print(item)

current working directory: /home/workspace
30 files found in /home/workspace/event_data


### 2.3 Processing the files to create the data file csv that will be used for Apache Casssandra tables

In [3]:
# initiating an empty list of rows that will be generated from each file
full_data_rows_list = [] 
    
# for every filepath in the file path list 
for f in file_path_list:

# reading csv file 
    with open(f, 'r', encoding = 'utf8', newline='') as csvfile: 
        # creating a csv reader object 
        csvreader = csv.reader(csvfile)
        # skip column names
        next(csvreader)
        
 # extracting each data row one by one and append it        
        for line in csvreader:
            #DEBUG: print every line of raw data
#            print(line)
            full_data_rows_list.append(line)
        print('Data added to full_data_rows_list from: %s' % f)
            
# DEBUG: get total number of rows
# print(len(full_data_rows_list))
# DEBUG: see what the list of event data rows will look like
#for number, row in enumerate(full_data_rows_list):
#    if row[0] != '':
#        print(number, row)

# creating a smaller event data csv file called event_datafile_full csv that will be used to insert data into the \
# Apache Cassandra tables
csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, skipinitialspace=True)

i = 0
with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:
    writer = csv.writer(f, dialect='myDialect')
    writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\
                'level','location','sessionId','song','userId'])
    for row in full_data_rows_list:
        if (row[0] == ''):
            continue
        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))

Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-30-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-23-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-22-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-29-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-11-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-14-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-20-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-15-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-05-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-28-events.csv
Data added to full_data_rows_list from: /home/workspace/event_data/2018-11-25-events.csv
Data added to full_da

In [4]:
# check the number of rows in your csv file
with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:
    print('Number of rows in new csv file: %s' % sum(1 for line in f))

Number of rows in new csv file: 6821


In [5]:
# show head of datafile
df = pd.read_csv('event_datafile_new.csv')
df.head()

Unnamed: 0,artist,firstName,gender,itemInSession,lastName,length,level,location,sessionId,song,userId
0,Stephen Lynch,Jayden,M,0,Bell,182.85669,free,"Dallas-Fort Worth-Arlington, TX",829,Jim Henson's Dead,91
1,Manowar,Jacob,M,0,Klein,247.562,paid,"Tampa-St. Petersburg-Clearwater, FL",1049,Shell Shock,73
2,Morcheeba,Jacob,M,1,Klein,257.41016,paid,"Tampa-St. Petersburg-Clearwater, FL",1049,Women Lose Weight (Feat: Slick Rick),73
3,Maroon 5,Jacob,M,2,Klein,231.23546,paid,"Tampa-St. Petersburg-Clearwater, FL",1049,Won't Go Home Without You,73
4,Train,Jacob,M,3,Klein,216.76363,paid,"Tampa-St. Petersburg-Clearwater, FL",1049,Hey_ Soul Sister,73


## 3. Build up an Apache Cassandra database and do some queries

### 3.1 Creating a Cluster

In [6]:
# Create a connection to a Cassandra instance your local machine 
# (127.0.0.1)

from cassandra.cluster import Cluster
cluster = Cluster()

# Create a session to establish connection and begin executing queries
session = cluster.connect()

### 3.2 Create Keyspace

In [7]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS music_app 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e) 

### 3.3 Set Keyspace

In [8]:
try:
    session.set_keyspace('music_app')
except Exception as e:
    print(e)

### 3.4 Create tables in database for the given queries
Using NoSQL databases like Apache Cassandra means that you are not allowed to JOIN between different tables and that denormalization is a must. One table per query is a common strategy and so you have to come from the queries first.

The following queries are commonly used in our business case:

1. Give me the **artist**, **song title** and **song's length** in the music app history that was heard during **sessionId = 338** and **itemInSession = 4**
2. Give me only the following: **name of artist**, **song (sorted by itemInSession)** and **user (first and last name)** for **userid = 10, sessionid = 182**
3. Give me every **user name (first and last)** in my music app history who listened to the **song 'All Hands Against His Own'**

In [9]:
# function to create a table in Apache Cassandra
def create_table(session, table_name, columns_with_datatype, primary_key):
    query = f"CREATE TABLE IF NOT EXISTS {table_name} ({columns_with_datatype}, PRIMARY KEY ({primary_key}));"
    print(query)
    try:
        session.execute(query)
        print('Table <%s> created successfully' % table_name)
    except Exception as e:
        print(e)

In [10]:
# function to import data in a table in Apache Cassandra
def import_data(session, table_name, file, columns):
    # show query
    #print(f"INSERT INTO {table_name} ({columns}) VALUES ({values});")
    # read csv file with the data in pandas dataframe
    try:
        file_data = pd.read_csv(file, usecols=columns, encoding='utf8')
        print('<%s> opened successfully' % file)
        print('Insert data ...')
        # import data in database line by line
        for index, line in file_data.iterrows():
            vals = []
            for col in columns:
                # extract values from line and replace single quotes in strings
                try:
                    val = line[col].replace("'","")
                except:
                    val = line[col]
                vals.append(val)
            query = f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES {tuple(vals)};"
            # insert data from the csv into the table
            session.execute(query)
        print('Data inserted successfully')
    except Exception as e:
        print(e)
        print(query)

In [11]:
def query_data(session, table_name, select_parameters, where_string):
    query = f"SELECT {', '.join(select_parameters)} FROM {table_name} WHERE {where_string}"
    # show query
    print(query)
    try:
        rows = session.execute(query)
    except Exception as e:
        print(e)
    # show dataframe with solution of query 
    df = pd.DataFrame(list(rows))
    return df

#### 3.4.1 Query 1: *Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338 and itemInSession = 4*
Create a table only with the mentioned parameters. The look-up parameters in the second part can be used as a compound primary key because they are unique for the given dataset.
##### 3.4.1.1 Create table

In [12]:
# define table_name, columns with dataypes and primary key
table_name = 'songInfo'
columns_with_datatype = 'sessionId int, itemInSession int, artist text, song text, length float'
primary_key = 'sessionId, itemInSession'

In [13]:
# create table with the predefined parameters
create_table(session, table_name, columns_with_datatype, primary_key)

CREATE TABLE IF NOT EXISTS songInfo (sessionId int, itemInSession int, artist text, song text, length float, PRIMARY KEY (sessionId, itemInSession));
Table <songInfo> created successfully


##### 3.4.1.2 Import parameters

In [14]:
# define import parameters
file = 'event_datafile_new.csv'
columns = ['sessionId', 'itemInSession', 'artist', 'song', 'length']

In [15]:
# import data with the predefined parameters
import_data(session, table_name, file, columns)

<event_datafile_new.csv> opened successfully
Insert data ...
Data inserted successfully


##### 3.4.1.3 Make query

In [16]:
# define query parameters
select_parameters = ['artist', 'song', 'length']
where_string = 'sessionId = 338 AND itemInSession = 4'

In [17]:
# make query with the predefined parameters
query_data(session, table_name, select_parameters, where_string)

SELECT artist, song, length FROM songInfo WHERE sessionId = 338 AND itemInSession = 4


Unnamed: 0,artist,song,length
0,Faithless,Music Matters (Mark Knight Dub),495.307312


#### 3.4.2 Query 2: *Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182*
Create a table only with the mentioned parameters. The parameters in the WHERE clause can be used as a compound primary key with userId and sessionId because they are unique for the given dataset. In addition itemInSession is added as clustering key for sorting.
##### 3.4.2.1 Create table

In [18]:
# define table_name, columns with dataypes and primary key
table_name = 'sessionInfo'
columns_with_datatype = 'userId int, sessionId int, itemInSession int, artist text, song text, firstName text, lastName text'
primary_key = '(userId, sessionId), itemInSession'

In [19]:
# create table with the predefined parameters
create_table(session, table_name, columns_with_datatype, primary_key)

CREATE TABLE IF NOT EXISTS sessionInfo (userId int, sessionId int, itemInSession int, artist text, song text, firstName text, lastName text, PRIMARY KEY ((userId, sessionId), itemInSession));
Table <sessionInfo> created successfully


##### 3.4.2.2 Import parameters

In [20]:
# define import parameters
file = 'event_datafile_new.csv'
columns = ['userId', 'sessionId', 'itemInSession', 'artist', 'song', 'firstName', 'lastName']

In [21]:
# import data with the predefined parameters
import_data(session, table_name, file, columns)

<event_datafile_new.csv> opened successfully
Insert data ...
Data inserted successfully


##### 3.4.2.3 Make query

In [22]:
# define query parameters
select_string = ['artist', 'song', 'firstName', 'lastName']
where_string = 'userId = 10 AND sessionId = 182'

In [23]:
# make query with the predefined parameters
query_data(session, table_name, select_string, where_string)

SELECT artist, song, firstName, lastName FROM sessionInfo WHERE userId = 10 AND sessionId = 182


Unnamed: 0,artist,song,firstname,lastname
0,Down To The Bone,Keep On Keepin On,Sylvie,Cruz
1,Three Drives,Greece 2000,Sylvie,Cruz
2,Sebastien Tellier,Kilometer,Sylvie,Cruz
3,Lonnie Gordon,Catch You Baby (Steve Pitron & Max Sanna Radio...,Sylvie,Cruz


#### 3.4.3 Query 3: *Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'*
Create a table with the mentioned parameters and additional parameters in order to get a unique primary key. I'll use userId because it's the best parameter to seperate users.
##### 3.4.3.1 Create table

In [24]:
# define table_name, columns with dataypes and primary key
table_name = 'userInfo'
columns_with_datatype = 'song text, userId int, firstName text, lastName text'
primary_key = 'song, userId'

In [25]:
# create table with the predefined parameters
create_table(session, table_name, columns_with_datatype, primary_key)

CREATE TABLE IF NOT EXISTS userInfo (song text, userId int, firstName text, lastName text, PRIMARY KEY (song, userId));
Table <userInfo> created successfully


##### 3.4.3.2 Import parameters

In [26]:
# define import parameters
file = 'event_datafile_new.csv'
columns = ['song', 'userId', 'firstName', 'lastName']

In [27]:
# import data with the predefined parameters
import_data(session, table_name, file, columns)

<event_datafile_new.csv> opened successfully
Insert data ...
Data inserted successfully


##### 3.4.3.3 Make query

In [28]:
# define query parameters
select_string = ['song', 'firstName', 'lastName']
where_string = "song = 'All Hands Against His Own'"

In [29]:
# make query with the predefined parameters
query_data(session, table_name, select_string, where_string)

SELECT song, firstName, lastName FROM userInfo WHERE song = 'All Hands Against His Own'


Unnamed: 0,song,firstname,lastname
0,All Hands Against His Own,Jacqueline,Lynch
1,All Hands Against His Own,Tegan,Levine
2,All Hands Against His Own,Sara,Johnson


### Drop the tables before closing out the sessions

In [30]:
# list of used tables
tables = ['songInfo', 'sessionInfo', 'userInfo']

# drop all tables in defined list
for table in tables:
    query = f"DROP TABLE IF EXISTS {table}"
    try:
        rows = session.execute(query)
        print(f"{table} dropped")
    except Exception as e:
        print(e)

songInfo dropped
sessionInfo dropped
userInfo dropped


### Close the session and cluster connection¶

In [31]:
session.shutdown()
cluster.shutdown()