In [1]:
import pandas as pd
import os


DATA_ROOT = os.path.join(os.getcwd(), 'data')

df_colors = pd.read_csv(os.path.join(
    DATA_ROOT, 'The Joy Of Painiting - Colors Used'))

df_subjects = pd.read_csv(os.path.join(
    DATA_ROOT, 'The Joy Of Painiting - Subject Matter'))

df_dates = pd.read_csv(os.path.join(
    DATA_ROOT, 'The Joy Of Painting - Episode Dates'), delimiter="\t", header=None, names=['src'])


## Colors Used

- First column is an unlabeled autoincrement index
- 403 unique rows identified by `season` + `episode`-- drop first column and use this as a natural key
- 401 unique `painting_titles`
- 403 unique `painting_index`
- Cardinality between `painting_index` and `season` + `episode` is 1:1-- no need to create a seperate "paintings" dimension unless client is anticipating some awful "The Joy Of Painiting" reeboot that features either multiple paintings per episode, or paintings that take multiple episodes to complete. 


In [8]:
print("The Joy Of Painiting - Colors Used")
print(f"- Initial rows: {len(df_colors.index)}")
print(f"- Total valid: {len(df_colors.dropna().index)}")
print(
    f"- Unique episodes: {len(df_colors[['season', 'episode']].drop_duplicates().index)}")
print(
    f"- Unique episodes: {len(df_colors[['season', 'episode']].drop_duplicates().index)}")
print(
    f"- Unique painting_title: {len(df_colors[['painting_title']].drop_duplicates().index)}")
print(
    f"- Unique painting_index: {len(df_colors[['painting_index']].drop_duplicates().index)}")

df_colors.head(3)


The Joy Of Painiting - Colors Used
- Initial rows: 403
- Total valid: 403
- Unique episodes: 403
- Unique episodes: 403
- Unique painting_title: 401
- Unique painting_index: 403


Unnamed: 0.1,Unnamed: 0,painting_index,img_src,painting_title,season,episode,num_colors,youtube_src,colors,color_hex,...,Liquid_Clear,Midnight_Black,Phthalo_Blue,Phthalo_Green,Prussian_Blue,Sap_Green,Titanium_White,Van_Dyke_Brown,Yellow_Ochre,Alizarin_Crimson
0,1,282,https://www.twoinchbrush.com/images/painting28...,A Walk in the Woods,1,1,8,https://www.youtube.com/embed/oh5p5f5_-7A,"['Alizarin Crimson', 'Bright Red', 'Cadmium Ye...","['#4E1500', '#DB0000', '#FFEC00', '#102E3C', '...",...,0,0,0,1,1,1,1,1,0,1
1,2,283,https://www.twoinchbrush.com/images/painting28...,Mt. McKinley,1,2,8,https://www.youtube.com/embed/RInDWhYceLU,"['Alizarin Crimson', 'Bright Red', 'Cadmium Ye...","['#4E1500', '#DB0000', '#FFEC00', '#102E3C', '...",...,0,0,0,1,1,1,1,1,0,1
2,3,284,https://www.twoinchbrush.com/images/painting28...,Ebony Sunset,1,3,9,https://www.youtube.com/embed/UOziR7PoVco,"['Alizarin Crimson', 'Black Gesso', 'Bright Re...","['#4E1500', '#000000', '#DB0000', '#FFEC00', '...",...,0,0,0,1,1,1,1,1,0,1


## Subject Matter
- Subject matter also has 403 unique IDs/401 unique titles. Assuming 403 total episodes w/ at least 1 repeat title.
- `EPISODE` is always 6 chars formated `S##E##`

In [7]:
print("The Joy Of Painiting - Subject Matter")
print(f"- Initial rows: {len(df_subjects.index)}")
print(f"- Total valid: {len(df_subjects.dropna().index)}")
print(
    f"- Unique episodes: {len(df_subjects[['EPISODE']].drop_duplicates().index)}")
print(
    f"- Unique titles: {len(df_subjects[['TITLE']].drop_duplicates().index)}")
df_subjects['len_EPISODE'] = df_subjects.apply(
    lambda r: len(r['EPISODE']), axis=1)
print(
    f"- Min/max len of EPISODE: {min(df_subjects['len_EPISODE']), max(df_subjects['len_EPISODE'])}")
df_subjects.head(3)


The Joy Of Painiting - Subject Matter
- Initial rows: 403
- Total valid: 403
- Unique episodes: 403
- Unique titles: 401
- Min/max len of EPISODE: (6, 6)


Unnamed: 0,EPISODE,TITLE,APPLE_FRAME,AURORA_BOREALIS,BARN,BEACH,BOAT,BRIDGE,BUILDING,BUSHES,...,TREE,TREES,TRIPLE_FRAME,WATERFALL,WAVES,WINDMILL,WINDOW_FRAME,WINTER,WOOD_FRAMED,len_EPISODE
0,S01E01,"""A WALK IN THE WOODS""",0,0,0,0,0,0,0,1,...,1,1,0,0,0,0,0,0,0,6
1,S01E02,"""MT. MCKINLEY""",0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,1,0,6
2,S01E03,"""EBONY SUNSET""",0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,1,0,6


## Episode Dates

- Imported as a single column with (at minimum), an episode title, and a date (formatted `(Month D, YYYY)` )
- Some rows have episode descriptions following the date-- can split to find using `(Month D, YYYY)` as a delim
- No good identifier data to tie to other two data sets
- Since rowcounts align between all datasets, assume that join can be made between dates and episode Ids ordered in the same direction

On the last point, I went down a nasty rabbit hole trying to join between all three data sets using titles, foolishly assuming that this would line up. After a lengthly process of fixing the small percentage of misaligned titles, I realized that, as long as the rowcounts matched between Episode Dates and the other two datasets, it would have been FAR more reliable to start out with this method. Note to self and anyone reading: **do not go down rabbit holes trying to clean free text data when you have literally any other option**. One dataset's `SHADES OF GREY` is another dataset's `SHADES OF GRAY`.



In [6]:
print("The Joy Of Painiting - Episode Dates")
print(f"- Initial rows: {len(df_dates.index)}")
print(f"- Total valid: {len(df_dates.dropna().index)}")
# add another column with length of the imported string
df_dates['src_len'] = df_dates.apply(lambda r: len(r['src']), axis=1)
print(f"- Max src_len: {max(df_dates['src_len'])}")
print("Longest 5 src:")
for ix, row in df_dates.sort_values('src_len', ascending=False).head(5).iterrows():
    print(f"   - {row['src']}")
print("Shortest 5 src:")
for ix, row in df_dates.sort_values('src_len').head(5).iterrows():
    print(f"   - {row['src']}")

The Joy Of Painiting - Episode Dates
- Initial rows: 403
- Total valid: 403
- Max src_len: 183
Longest 5 src:
   - Fisherman's Trail (May 25, 1993) - After painting the canvas to resemble wood, Ross paints a landscape with the titular trail, plus mountains, trees, water, and shrubbery, but no sky.
   - Waterfall Wonder (October 26, 1988) Footage with Grand Ole Opry regular Hank Snow and announcer Grant Turner
   - Contemplative Lady (September 21, 1988) Special guest John Thamm (Bob Ross's former instructor)
   - Mountain Lake Falls (September 28, 1993) Special guest Steve Ross (Bob's son)
   - That Time of Year (October 19, 1988) Special guest Steve Ross (Bob's son)
Shortest 5 src:
   - Surf's Up (May 7, 1986)
   - Cliffside (May 9, 1990)
   - Blue River (May 1, 1985)
   - Campfire (March 8, 1984)
   - Seascape (March 1, 1983)
