# Introduction
All videos that are part of this exercise have accompanying OBS logs, so the analysis will start with those.  Recordings to `.mkv` files have the explicit filename in the log, but Twitch video ID numbers are not present.  Fortunately, the `.json` files downloaded with [`tcd`](https://pypi.org/project/tcd/) have timestamps and durations, and I should be able to match them up using the overlap of the time periods.

# YouTube
Seeing what got uploaded to the [dma's Twitch Archive](https://www.youtube.com/channel/UCWlc332uGYwHVPSTyDwFW-g) channel might be a bit more of a pain.  Some videos were uploaded using streaming-quality videos either copied across directly from Twitch or uploaded from a local `.mp4` file.  Others were uploaded using the `.mkv` file generated directly by OBS.  YouTube keeps the name of the file used for the upload, so I should be able to use this for matching in many cases.  In other cases, I made a manual link to the video ID in the YouTube description, so I should be able to use that.

In order to access the channel with the ability to modify things, I'll need to be authenticated properly, and the [OAuth 2.0 Playground](https://developers.google.com/oauthplayground/) site should help with that.  [YouTube Data API v3](https://developers.google.com/youtube/v3/docs/) is used to modify and query videos, comments, playlists, and other YouTube objects.  I will need an OAuth 2.0 token supplied either through an `access_token` query parameter or an `Authorization: ` HTTP header.  

```
GET https://www.googleapis.com/youtube/v3/search?part=snippet&forMine=true&order=date&type=video&key=[YOUR_API_KEY] HTTP/1.1

Authorization: Bearer [YOUR_ACCESS_TOKEN]
Accept: application/json
```

Links:
* [Google APIs](https://console.developers.google.com/)
* [Google Cloud Platform](https://console.cloud.google.com/) (seems to be a superset of the API page)



# Jupyter Settings
The default Jupyter settings are a bit annoying.  Let's fix that.

In [None]:
%config IPCompleter.greedy=False

# From https://stackoverflow.com/a/34058270/7077511
from IPython.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

import pandas as pd
pd.options.display.max_rows = 1000
pd.options.display.max_colwidth = 200

# Globals and Utility Code
For the sake of convenience, I'm going to put some global functions and data here.

In [None]:
class util:
    # Path constants.
    OBS_LOGS_DIR = R'C:\Users\dma\AppData\Roaming\obs-studio\logs'
    DATA_DIR = R'C:\Users\dma\Documents\twitch-analytics\data'
    TWITCH_LOGS_DIR = R'C:\Users\dma\Documents\twitch-analytics\data\chat-logs'
    
    # Matches either YouTube style ("PT52M3S") or Twitch style ("4h10m52s").
    _re_duration = re.compile(R'(PT)?((\d+)[Hh])?((\d+)[Mm])?((\d+)[Ss])?')
    @staticmethod
    def parse_duration(data: str):
        if (m := util._re_duration.match(data)):
            duration = timedelta(
                hours   = int(m.group(3)) if m.group(3) else 0,
                minutes = int(m.group(5)) if m.group(5) else 0, 
                seconds = int(m.group(7)) if m.group(7) else 0)

            return duration
        
        return None


In [None]:
print(obs_log_dir)

In [None]:
from dataclasses import dataclass
import re
import glob
import json
import pandas as pd

youtube_videos = []

for fn in glob.glob(R'C:\Users\dma\Documents\twitch-analytics\youtube\details*.json'):
    # print(fn)
    
    with open(fn) as json_file:
        data = json.load(json_file)
        youtube_videos.extend(data["items"])


In [None]:
# Get a list of video files we actually have on disk.
local_obs_mkv_files = glob.glob(R'G:\Library\Twitch\OBS_Video\*.mkv')
local_twitch_mp4_files = glob.glob(R'G:\Library\Twitch\Downloads\*.mp4')

# Twitch Chat Logs
The initial pass is just getting the starting time and duration of each Twitch video.  Note that I didn't start downloading these until 2019-03-25.  Anything before that is a lost cause.

In [None]:
from datetime import datetime, timezone, timedelta
from typing import List
import dateutil.parser
import pandas as pd

@dataclass
class TwitchVideo:
    id: int
    start_date: datetime
    duration: timedelta
    title : str
    obj: object


twitch_videos : List[TwitchVideo] = []

for fn in glob.glob(R'C:\Users\dma\Documents\twitch-analytics\data\chat-logs\*.json'):
    with open(fn) as json_file:
        data = json.load(json_file)
        v = data['video']
        duration = util.parse_duration(v['duration'])
        id = int(v["id"])
        twitch_videos.append(TwitchVideo(
            id,
            dateutil.parser.isoparse(v['created_at']).timestamp(),
            duration, v['title'], v))


# OBS Log Files
OBS logs are annoying things.  Their timestamps don't have the dates, so I'll need to infer those from the file name itself.  In a couple files, the log spans multiple days (and this can be seen as OBS writes out a large number of records while the computer is locked overnight).  Fortunately, I always restart OBS before actually using it, so I shouldn't need to look for evidence of the clock passing midnight.
    
## Recording
From `2019-11-24 18-16-57.txt`:
```
18:18:19.647: ==== Recording Start ===============================================
18:18:19.647: [ffmpeg muxer: 'adv_file_output'] Writing file 'E:/OBS_Video/2019-11-24 18-18-19.mkv'...

20:38:25.104: [ffmpeg muxer: 'adv_file_output'] Output of file 'E:/OBS_Video/2019-11-24 18-18-19.mkv' stopped
20:38:25.104: Output 'adv_file_output': stopping
20:38:25.104: Output 'adv_file_output': Total frames output: 504318
20:38:25.104: Output 'adv_file_output': Total drawn frames: 504326 (504327 attempted)
20:38:25.104: Output 'adv_file_output': Number of lagged frames due to rendering lag/stalls: 1 (0.0%)
20:38:25.104: ==== Recording Stop ================================================
```

## Streaming
From `2019-11-24 18-16-57.txt`:
```
18:18:20.599: [rtmp stream: 'adv_stream'] Connection to rtmp://live-iad.twitch.tv/app successful
18:18:20.600: ==== Streaming Start ===============================================

20:38:25.049: [rtmp stream: 'adv_stream'] User stopped the stream
20:38:25.049: Output 'adv_stream': stopping
20:38:25.049: Output 'adv_stream': Total frames output: 504259
20:38:25.049: Output 'adv_stream': Total drawn frames: 504329 (504330 attempted)
20:38:25.049: Output 'adv_stream': Number of lagged frames due to rendering lag/stalls: 1 (0.0%)
20:38:25.049: ==== Streaming Stop ================================================
```

For reference, here is the relevant data from `512920284.json`, the matching Twitch log.  Note that the timestamps are in UTC and that they're lagged by several seconds.
```js
    "video": {
        "created_at": "2019-11-24T23:18:29Z",
        "description": "",
        "duration": "2h20m0s",
        "id": "512920284",
        "language": "en",
        "published_at": "2019-11-24T23:18:29Z",
        // ...
    }
```

In [None]:
# See https://stackoverflow.com/questions/33837918/type-hints-solve-circular-dependency
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from typing import List, Set, Optional, Union
import os
import re
import pytz
import warnings

# MKV files that are bad either due to running out of disk space or me screwing something up.
# The equivalent MP4 files from the Twitch VODs will need to be used instead.
MKV_BLACKLIST = [
    # Ran out of space.
    "2019-07-04 09-05-11.mkv",
    "2019-07-04 15-13-52.mkv",
    "2019-07-06 09-00-03.mkv",
    "2019-07-06 14-05-43.mkv",
    "2019-07-11 20-26-32.mkv",
    "2019-07-11 20-27-33.mkv",
    "2019-07-17 19-27-48.mkv",
    "2019-07-17 19-32-40.mkv",
    "2019-07-20 13-41-11.mkv",
    "2019-07-20 19-37-44.mkv",
    
    # Accidentally forgot to record until the very end.
    "2020-08-14 08-59-20.mkv",
    "2019-11-02 18-08-02.mkv",
    
    # Recording of some training.
    "2022-03-31 13-24-25.mkv",
    "2022-04-01 09-01-19.mkv",
    
    # These were accidentally recorded at too low of a bitrate, so the Twitch VODs are better.
    "2022-04-09 18-53-34.mkv",
    "2022-04-09 18-53-34.mkv",
    "2022-04-10 17-01-21.mkv",
    "2022-04-10 18-37-13.mkv",
    "2022-04-23 14-25-01.mkv",
    "2022-04-24 10-12-11.mkv",
    "2022-04-24 14-46-01.mkv",
    "2022-04-28 19-42-06.mkv",
    "2022-04-30 16-31-33.mkv",
    "2022-05-01 09-10-30.mkv",
    "2022-05-01 19-58-20.mkv",
    "2022-05-02 19-54-03.mkv",
    "2022-05-03 20-02-46.mkv",
    "2022-05-05 17-18-28.mkv",
    "2022-05-06 17-50-31.mkv",
    "2022-05-07 10-21-53.mkv",
    "2022-05-07 15-44-28.mkv",
    "2022-05-08 14-38-55.mkv",
    "2022-05-14 09-50-20.mkv",
    "2022-05-14 13-24-18.mkv",
    "2022-05-15 14-02-22.mkv",
    "2022-05-15 14-02-22.mkv"
]

# These are not games.
PROC_BLACKLIST = [
    "LockApp.exe", "chrome.exe", "explorer.exe", "ApplicationFrameHost.exe", 
    "unknown", "Steam.exe", "Discord.exe"]

class ObsLogEvent:
    '''Meaningful event in an OBS log file.'''

    def __init__(self, event_type: str, start_time: int):
        # Members.
        self.event_type : str = event_type
        self.start_time : int = start_time
        self.end_time : int = None
        
        # Output file used for a recording.
        self.mkv_file : str = None
            
        # YouTube video object (for recordings).
        self.youtube_video : object = None
        
        # ID number of Twitch video (for streams).
        self.twitch_id : int = None
        
        # Twitch video object (for streams).
        self.twitch_video : TwitchVideo = None
        
        # Related events (i.e. those with overlapping timestamps).
        self.related : List[ObsLogEvent] = []
        
    def find_youtube_video(self):  
        # Find the YouTube video if possible.
        if (self.event_type == 'Recording'):
            mkv_base = os.path.basename(self.mkv_file)

            for v in youtube_videos:
                if v["fileDetails"]["fileName"] == mkv_base:
                    self.youtube_video = v
                    break
        elif (self.event_type == 'Stream' and self.twitch_id is not None):
            for v in youtube_videos:
                if v["fileDetails"]["fileName"].startswith(str(self.twitch_id) + "-"):
                    self.youtube_video = v
                    break
            
            
    def add_related(self, other: ObsLogEvent):
        if (other is not None and other is not self):
            # Make sure duplicates aren't inserted.  Considering that this list will usually
            # only have one value, this should be efficient enough.
            for i in self.related:
                if (i is other):
                    return
            
            self.related.append(other)
        
  
class ObsLogFile:
    '''OBS log file.'''
    
    re_filename = re.compile("\\\\(\\d\\d\\d\\d-\\d\\d-\\d\\d) \\d\\d-\\d\\d-\\d\\d\.txt")
    re_line = re.compile("^(..:..:..\\....): (\\[ffmpeg muxer: 'adv_file_output'\\] Writing file '(.*)'\\.\\.\\.|==== ((Recording|Streaming) (Start|Stop)) ======+)$")
    re_hook = re.compile("^.*attempting to hook fullscreen process: (.*)")
    
    def load_data(self):
        # Parse the filename itself to get the date.
        m = ObsLogFile.re_filename.search(self.filename)
        if m:
            date = m.group(1)
        
        # ...
        current_recording : ObsLogEvent = None
        current_stream : ObsLogEvent = None
        
        # Read through the file and pick out useful bits.
        with open(self.filename, "r") as f:
            line_num = 0
            for line in f:
                line_num = line_num + 1
                
                if (m := ObsLogFile.re_line.match(line)):
                    # Get the time and date.
                    # dt = datetime.strptime(date + " " + m.group(1), "%Y-%m-%dT%H:%M:%S.%f")
                    dt = dateutil.parser.isoparse(date + "T" + m.group(1)).timestamp()

                    # Process the actual line of the file.
                    if (m.group(3)):
                        current_recording.mkv_file = m.group(3)
                        current_recording.find_youtube_video()
                                              
                    elif (m.group(4) == "Recording Start"):
                        if (current_recording):
                            raise Exception("Started a recording when one was already in progress.", self.filename, line_num)

                        current_recording = ObsLogEvent("Recording", dt)
                        
                        # If simultaneously recording and streaming, associate the two activities.
                        if current_recording and current_stream:
                            current_stream.add_related(current_recording)
                            current_recording.add_related(current_stream)
                            
                        self.events.append(current_recording)
                        
                    elif (m.group(4) == "Recording Stop"):
                        if (not current_recording):
                            raise Exception("Ended a recording when none was in progress.", self.filename, line_num)


                        current_recording.end_time = dt
                        
                        # If simultaneously recording and streaming, associate the two activities.
                        if current_recording and current_stream:
                            current_stream.add_related(current_recording)
                            current_recording.add_related(current_stream)
                            
                        current_recording = None
                    elif (m.group(4) == "Streaming Start"):
                        if (current_stream):
                            raise Exception("Started a stream when one was already in progress.", self.filename, line_num)

                        current_stream = ObsLogEvent("Stream", dt)
                        
                        # If simultaneously recording and streaming, associate the two activities.
                        if current_recording and current_stream:
                            current_stream.add_related(current_recording)
                            current_recording.add_related(current_stream)
                            
                        self.events.append(current_stream)
                    elif (m.group(4) == "Streaming Stop"):
                        if (not current_stream):
                            # I'm reducing this one to a warning because it actually happened once.
                            warnings.warn(Warning("Ended a stream when none was in progress.", self.filename, line_num))
                            continue

                        current_stream.end_time = dt

                        # See if we can find the Twitch ID for this video.  The Twitch timestamp appears
                        # to lag about 9 seconds behind mine.
                        ts_start = current_stream.start_time
                        ts_end = current_stream.end_time
                        for tv in twitch_videos:
                            if tv.start_date > ts_start and tv.start_date < ts_end:
                                if current_stream.twitch_id:
                                    raise Exception("Found multiple twitch video IDs: ", tv, current_stream)
                                    
                                current_stream.twitch_id = tv.id
                                current_stream.twitch_video = tv
                                current_stream.find_youtube_video()
                                
                        if (not current_stream.twitch_id and current_stream.start_time > 1553513453 and (current_stream.end_time - current_stream.start_time > 120)):
                            warnings.warn(Warning("Could not find Twitch video ID.", self.filename, line_num))

                        # If simultaneously recording and streaming, associate the two activities.
                        if current_recording and current_stream:
                            current_stream.add_related(current_recording)
                            current_recording.add_related(current_stream)
                            
                        current_stream = None
                elif (m := ObsLogFile.re_hook.match(line)):
                    # This is a decent-enough way to determine what game I was playing for that session of OBS.
                    # I don't think I ever played multiple games in one session.
                    if (not m.group(1) in PROC_BLACKLIST):
                        self.hooked_procs.add(m.group(1))
                
    def __init__(self, filename: str):
        self.filename = filename
        self.events : List[ObsLogEvent] = []
        self.hooked_procs : Set[str] = set()
        self.load_data()

obs_logs : List[ObsLogFile] = []
    
for fn in sorted(glob.glob(R'C:\Users\dma\AppData\Roaming\obs-studio\logs\20*.txt')):
    print(fn)
    obs_logs.append(ObsLogFile(fn))




# Analysis Work
I'm going to put all analysis below this point.  Everything above is loading data.

In [None]:
# Find files that were uploaded more than once.
df = pd.DataFrame({
    'id': list(x["id"] for x in youtube_videos),
    'fileName': list(x["fileDetails"]["fileName"] for x in youtube_videos)
})

display(df.groupby(['fileName']).count().sort_values(by='id', ascending=False))

In [None]:
# Display certain YouTube videos.  Just edit this query as needed.
v = ([
    x["fileDetails"]["fileName"], 
    x["id"], 
    x["contentDetails"]["duration"],
    x["status"]["privacyStatus"],
    x["snippet"]["title"]
] for x in youtube_videos if x["fileDetails"]["fileName"] in [
    "2020-08-16 06-15-46.mkv", 
    "2020-08-09 14-00-45.mkv",
    "2020-12-12 15-10-18.mkv"])

display(pd.DataFrame(v))


In [None]:
# Display all videos that were uploaded from Twitch (as opposed to a .mp4 or mkv file).
v = ([
    x["fileDetails"]["fileName"], 
    x["id"], 
    x["contentDetails"]["duration"],
    x["status"]["privacyStatus"],
    x["snippet"]["title"]
] for x in youtube_videos if x["fileDetails"]["fileName"] in ["unknown"])

display(pd.DataFrame(v, columns=["fileName", "id", "duration", "privacyStatus", "title"]).sort_values(by="title"))


In [None]:
@dataclass
class SetTitleRecord:
    youtube_id: str
    original_url: str
    title: str
    duration: str
    created_at: str
        
TO_DO : List[SetTitleRecord] = []

for olf in obs_logs:
    for e in olf.events:
        if e.event_type=='Recording':
            twitch_ids = list(set([x.twitch_id for x in e.related if x.twitch_id]))
            
            if len(e.related) == 1 and e.related[0].twitch_video is not None:
                # For these, there's exactly one Twitch video and one YouTube video.
                yv = e.youtube_video
                tv = None
                if len(e.related) > 0:
                    tv = e.related[0].twitch_video
                
                # TO_DO.append(SetTitleRecord())
                # print("    ", e.__dict__)
                # print("    ", e.related[0].twitch_video)
                # print("    ", e.youtube_video)
                print("    ", tv.id, tv.title)
                if yv is None:
                    print("     Not Uploaded:", e.mkv_file)
                else:
                    print("    ", yv["id"], yv["snippet"]["description"], yv["snippet"]["title"])
                    
                # These are ones we need to set.
                if (yv is not None): # and (not yv["snippet"]["title"].startswith("[0") or not yv["snippet"]["description"].startswith("Original")):
                    TO_DO.append({ 
                            "current_title": yv["snippet"]["title"],
                            "youtube_id": yv["id"], 
                            "original_url": tv.obj["url"],
                            "title": tv.title.replace("#", ""),
                            "duration": tv.obj["duration"],
                            "created_at": tv.obj["created_at"]
                    })
                    
# This can be used in "set-youtube.py".
print(json.dumps(TO_DO, indent=2))


In [None]:
# Now, make sure that all Twitch IDs are represented somewhere on YouTube.

TO_DO : List[SetTitleRecord] = []

for olf in obs_logs:
    print(olf.filename, olf.hooked_procs)
    for e in olf.events:
        if e.event_type=='Stream':
            if e.twitch_video is not None:
                needs_mp4_vod = True

                print(f"  {e.twitch_video.id}, {e.twitch_video.obj['duration']}, {e.twitch_video.title}")

                # See if there's already a video uploaded from OBS.
                for re in e.related:
                    if (re.youtube_video is not None):
                        print(f'    MKV already uploaded: {re.mkv_file} - {re.youtube_video["id"]} - {re.youtube_video["snippet"]["title"]}')
                        needs_mp4_vod = False
                    elif os.path.basename(re.mkv_file) in MKV_BLACKLIST:
                        print(f'    Blacklisted MKV: {re.mkv_file}')
                    else:
                        print(f'    Consider uploading MKV: {re.mkv_file}')
                        needs_mp4_vod = False

                # See if the YouTube video is uploaded from one of the Twitch VODs.
                if (e.youtube_video is not None):
                    yv = e.youtube_video
                    tv = e.twitch_video
                    print(f'    {e.youtube_video["id"]} - {e.youtube_video["snippet"]["title"]}')
                    TO_DO.append({ 
                            "youtube_id": yv["id"], 
                            "original_url": tv.obj["url"],
                            "title": tv.title,
                            "duration": tv.obj["duration"],
                            "created_at": tv.obj["created_at"]
                    })
                    needs_mp4_vod = False
                elif needs_mp4_vod:
                    for z in local_twitch_mp4_files:
                        if os.path.basename(z).startswith(str(e.twitch_video.id)):
                            print(f'    Consider uploading this MP4: {z}')
            else:
                # There's a stream, but we don't have the JSON for it.
                print(f'  Unknown stream that lasted {e.end_time - e.start_time} seconds.')
            
                
            

print(json.dumps(TO_DO, indent=2))


In [None]:
# Build a table of 
re_vodlink = re.compile(R'https://www.twitch.tv/videos/(\d+)')
re_part = re.compile(R'( *([\[(]Part #?(\d+)[\])]|\[#?0*(\d+)\]) *)')

for yv in youtube_videos:
    vod_id = None
    probable_vod_id = None
    part_num = None
    title = yv["snippet"]["title"]
    
    duration = util.parse_duration(yv["contentDetails"]["duration"]).total_seconds()
    
    if (m := re_part.search(title)):
        if (m.group(3) is not None):
            part_num = int(m.group(3))

        if (m.group(4) is not None):
            part_num = int(m.group(4))
            
        title = re_part.sub("", title)

    # Try to find the VOD ID with the description.
    if (m := re_vodlink.search(yv["snippet"]["description"])):
        vod_id = int(m.group(1))
    
    # See if there's a VOD with the same title and length.
    for v in twitch_videos:
        if v.title == yv["snippet"]["title"]:
            probable_vod_id = v.id
    
    if yv["fileDetails"]["fileName"] == "unknown":
        print(yv["id"], vod_id, duration, yv["fileDetails"]["fileName"], title, part_num, yv["snippet"]["title"], sep='\t')


In [None]:
for v in twitch_videos:
    print(v.id, v.duration.total_seconds(), v.title, sep='\t')