<a href="https://colab.research.google.com/github/ajinkyabhanudas/SIADS696/blob/dev/Pre_Data_cleaning_EDA_Ajinkya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#This notebook captures thoughts on a probable choice of features based on visual inspection and prior knowledge

##1
Reasons for direct elimination of the following properties:
- channelId, channelTitle: might be a point of data leakage that leads to overfitting.
- thumbnails: we're not working on image data and will hence disregard these values.
- defaultAudioLanguage: we've restricted the data to be english only.

Post visual inspection elimination:
- dimension, licensedContent: we have no working theory on how to leverage this property (example '2d') as a feature in our model and would want to avoid any spurious correlations.
- liveBroadcastContent: we're looking for a view count independent of this state. This is because, consideration of this property would lead to an imbalance in the live videos available.
- localized: seems to have a dictionary of details like description again.

Need further review:
- definition: we'd have to inspect to see if a video not having the possibility of "hd" can have a strong impact on its views. correlation might be a good point to start.
- tags: might not be useful if, they have markers in them that indicate the channel name, it is likely to throw a model off.
- content_rating: due to sparse availability, this feature might not add significant value with it presence


##2
Finally, the features that we plan on extracting can be utilised to work up 2 different approaches:
- A 2-fold views prediction task based on the following features: 
  - title, topicCategories, duration: for first time content creators)
  - title, topicCategories, duration, categoryId, historical aggregates of (subcriber, view, like, favorite, comment count): for users who have created content in the past. An alternative approach could also be to make use of AR(I)MA(X) models with channel specific historical video stats data and exogenous variables like publishedAt.

- tags, definition, and contentRating need to be evaluated further to look for possible correlations.
- Since users don't interact with descriptions before watching a video, in most cases, it might be worth further evaluation too.
- total viewCount, videoCount features might need a further evaluation for its utility with aggregate calculations in regression based models.






In [37]:
!git clone https://github.com/ajinkyabhanudas/SIADS696.git

Cloning into 'SIADS696'...
remote: Enumerating objects: 263, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 263 (delta 8), reused 2 (delta 2), pack-reused 221[K
Receiving objects: 100% (263/263), 13.92 MiB | 15.52 MiB/s, done.
Resolving deltas: 100% (132/132), done.


In [38]:
import glob
import json
import os
import pandas as pd

In [39]:
try:
  os.chdir('SIADS696')
except:
  print("You're either already in the SIADS696 directory, or the path specified isn't accessible")

In [40]:
rootdir = 'data'
for path in glob.glob(f'./{rootdir}/*/*'):
    with open(path, "r") as read_file:
      data = json.load(read_file)
      break


In [41]:
channel_id = list(data.keys())[0]

In [42]:
data[channel_id]["channel_statistics"]

{'viewCount': '13637216111',
 'subscriberCount': '55900000',
 'hiddenSubscriberCount': False,
 'videoCount': '403'}

In [43]:
video_key = list(data[channel_id]["video_data"].keys())[0]
data[channel_id]["video_data"][video_key].keys()

dict_keys(['publishedAt', 'title', 'channelId', 'description', 'thumbnails', 'channelTitle', 'tags', 'categoryId', 'liveBroadcastContent', 'localized', 'defaultAudioLanguage', 'viewCount', 'likeCount', 'favoriteCount', 'commentCount', 'duration', 'dimension', 'definition', 'caption', 'licensedContent', 'contentRating', 'projection', 'topicCategories'])

Reasons for direct elimination of the following properties:
- channelId, channelTitle: might be a point of data leakage that leads to overfitting.
- thumbnails: we're not working on image data and will hence disregard these values.
- defaultAudioLanguage: we've restricted the data to be english only.

Post visual inspection elimination:
- dimension, licensedContent: we have no working theory on how to leverage this property (example '2d') as a feature in our model and would want to avoid any spurious correlations.
- liveBroadcastContent: we're looking for a view count independent of this state. This is because, consideration of this property would lead to an imbalance in the live videos available.
- localized: seems to have a dictionary of details like description again.

Need further review:
- definition: we'd have to inspect to see if a video not having the possibility of "hd" can have a strong impact on its views. correlation might be a good point to start.
- tags: might not be useful if, they have markers in them that indicate the channel name, it is likely to throw a model off.
- content_rating: due to sparse availability, this feature might not add significant value with it presence

In [44]:
explore_cats = ["dimension", "liveBroadcastContent", "localized", "title", "description", "tags", "categoryId",  "viewCount", 
 "likeCount", "favoriteCount", "commentCount", 
 "duration", "definition", "caption",
 "licensedContent", "contentRating", "topicCategories"]

Uncomment and run the cell below to have a view of what a particular channel's data looks like before the visual inspection based elimination.

In [45]:
# for cat in explore_cats:
#   print(f'{cat}***: {data[channel_id]["video_data"][video_key][cat]}')

hypothesis check: with and without subscriber:video count within the dependent data

In [46]:
tentative_cats = ["title", "description", "categoryId",  "viewCount", "tags",
 "likeCount", "favoriteCount", "commentCount", 
 "duration", "definition",
 "licensedContent", "contentRating", "topicCategories"]

In [47]:
# for cat in tentative_cats:
#   print(f'{cat}***: {data[channel_id]["video_data"][video_key][cat]}')

Finally, the features that we plan on extracting can be utilised to work up 2 different approaches:
- A 2-fold views prediction task based on the following features: 
  - title, topicCategories, duration: for first time content creators)
  - title, topicCategories, duration, categoryId, historical aggregates of (view, like, favorite, comment count): for users who have a created content in the past.

- tags, definition, and contentRating need to be evaluated further to look for possible correlations.
- Since users don't interact with descriptions before watching a video in most cases, it might be worth furhter evaluation too.






In [49]:
["publishedAt", "title", "description", "categoryId",  "viewCount", "tags"
 "likeCount", "favoriteCount", "commentCount", 
 "duration", "definition",
 "licensedContent", "contentRating", "topicCategories"]

['publishedAt',
 'title',
 'description',
 'categoryId',
 'viewCount',
 'tagslikeCount',
 'favoriteCount',
 'commentCount',
 'duration',
 'definition',
 'licensedContent',
 'contentRating',
 'topicCategories']