# Preppin Data
## Week 6: 7- Letter Scrabble Words

https://preppindata.blogspot.com/2022/02/2022-week-6-7-letter-scrabble-words.html

### 1. Import pandas and file

In [22]:
import pandas as pd

In [194]:
file = '7 letter words.xlsx'
words = pd.read_excel(file, sheet_name='7 letter words')
scores = pd.read_excel(file, sheet_name='Scrabble Scores')

words

Unnamed: 0,7 letter word
0,ability
1,absence
2,academy
3,account
4,accused
...,...
968,quonked
969,quopped
970,rhizoma
971,rhizome


In [195]:
scores

Unnamed: 0,Scrabble
0,0 points: Blank ×2
1,"1 point: E ×12, A ×9, I ×9, O ×8, N ×6, R ×6, ..."
2,"2 points: D ×4, G ×3"
3,"3 points: B ×2, C ×2, M ×2, P ×2"
4,"4 points: F ×2, H ×2, V ×2, W ×2, Y ×2"
5,5 points: K ×1
6,"8 points: J ×1, X ×1"
7,"10 points: Q ×1, Z ×1"


### 2. Parse out the Scrabble Scores input so there are 3 fields (Tile, Frequency, Points)

1. Use str.split() with expand=True to separate into columns where there is a comma
2. Use str.split() with spaces to get the points for each set of letters
3. Remove the points information, then use pd.melt() to reshape the data to put all the tile information into a single column
4. Use str.split() to produce separate tile and frequency information
5. Use slicing to remove 'x' from frequency information
6. Remove blank rows from dataframe
7. Convert 'points' data to numeric

In [196]:
#Split on comma with expand=True
scores = scores['Scrabble'].str.split(',', expand=True)
scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0 points: Blank ×2,,,,,,,,,
1,1 point: E ×12,A ×9,I ×9,O ×8,N ×6,R ×6,T ×6,L ×4,S ×4,U ×4
2,2 points: D ×4,G ×3,,,,,,,,
3,3 points: B ×2,C ×2,M ×2,P ×2,,,,,,
4,4 points: F ×2,H ×2,V ×2,W ×2,Y ×2,,,,,
5,5 points: K ×1,,,,,,,,,
6,8 points: J ×1,X ×1,,,,,,,,
7,10 points: Q ×1,Z ×1,,,,,,,,


In [201]:
#Split points information out before the first space
scores['Points'] = scores[0].str.split(' ', 1).str[0]
scores['Points']

0     0
1     1
2     2
3     3
4     4
5     5
6     8
7    10
Name: Points, dtype: object

In [202]:
#Remove the remaining points information from the first column
scores[0] = scores[0].str.split(':', ).str[-1]
scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,Points
0,Blank ×2,,,,,,,,,,0
1,E ×12,A ×9,I ×9,O ×8,N ×6,R ×6,T ×6,L ×4,S ×4,U ×4,1
2,D ×4,G ×3,,,,,,,,,2
3,B ×2,C ×2,M ×2,P ×2,,,,,,,3
4,F ×2,H ×2,V ×2,W ×2,Y ×2,,,,,,4
5,K ×1,,,,,,,,,,5
6,J ×1,X ×1,,,,,,,,,8
7,Q ×1,Z ×1,,,,,,,,,10


In [203]:
#Melt with id_vars = 'Points' to put all tile information into a single column
scores = scores.melt(id_vars='Points')
scores

Unnamed: 0,Points,variable,value
0,0,0,Blank ×2
1,1,0,E ×12
2,2,0,D ×4
3,3,0,B ×2
4,4,0,F ×2
...,...,...,...
75,3,9,
76,4,9,
77,5,9,
78,8,9,


In [204]:
#Drop the variable column as it has no meaning
scores = scores.drop('variable', axis=1)

Unnamed: 0,Points,value
0,0,Blank ×2
1,1,E ×12
2,2,D ×4
3,3,B ×2
4,4,F ×2
...,...,...
75,3,
76,4,
77,5,
78,8,


In [205]:
#Use str.split() to split out the tile and frequency information into separate columns
scores['Tile'] = scores['value'].str.split().str[0]
scores['Frequency'] = scores['value'].str.split().str[1]
scores

Unnamed: 0,Points,value,Tile,Frequency
0,0,Blank ×2,Blank,×2
1,1,E ×12,E,×12
2,2,D ×4,D,×4
3,3,B ×2,B,×2
4,4,F ×2,F,×2
...,...,...,...,...
75,3,,,
76,4,,,
77,5,,,
78,8,,,


In [206]:
#Slice into the columns to remove the 'x' from Frequency
scores['Frequency'] = scores['Frequency'].str[1:]
scores

Unnamed: 0,Points,value,Tile,Frequency
0,0,Blank ×2,Blank,2
1,1,E ×12,E,12
2,2,D ×4,D,4
3,3,B ×2,B,2
4,4,F ×2,F,2
...,...,...,...,...
75,3,,,
76,4,,,
77,5,,,
78,8,,,


In [207]:
#Drop the value column; we've captured the meaning for it
scores = scores.drop('value', axis=1)
scores

Unnamed: 0,Points,Tile,Frequency
0,0,Blank,2
1,1,E,12
2,2,D,4
3,3,B,2
4,4,F,2
...,...,...,...
75,3,,
76,4,,
77,5,,
78,8,,


In [208]:
#Drop rows where there is no information
scores = scores[scores.Tile.notnull()].copy()
scores

Unnamed: 0,Points,Tile,Frequency
0,0,Blank,2
1,1,E,12
2,2,D,4
3,3,B,2
4,4,F,2
5,5,K,1
6,8,J,1
7,10,Q,1
9,1,A,9
10,2,G,3


In [209]:
#Convert 'Frequency' to numeric
scores['Frequency'] = pd.to_numeric(scores['Frequency'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### 3. Calculate the percent chance of drawing a particular tile

1. Calculate the total number of tiles
2. Divide the frequency of tiles by the total number of tiles

In [210]:
#Take a sum of the frequency of the tiles and save as column
scores['Total Tiles'] = scores['Frequency'].sum()
scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Points,Tile,Frequency,Total Tiles
0,0,Blank,2,100
1,1,E,12,100
2,2,D,4,100
3,3,B,2,100
4,4,F,2,100
5,5,K,1,100
6,8,J,1,100
7,10,Q,1,100
9,1,A,9,100
10,2,G,3,100


In [211]:
#Calculate percent chance
scores['Percent Chance'] = scores['Frequency']/scores['Total Tiles']
scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Points,Tile,Frequency,Total Tiles,Percent Chance
0,0,Blank,2,100,0.02
1,1,E,12,100,0.12
2,2,D,4,100,0.04
3,3,B,2,100,0.02
4,4,F,2,100,0.02
5,5,K,1,100,0.01
6,8,J,1,100,0.01
7,10,Q,1,100,0.01
9,1,A,9,100,0.09
10,2,G,3,100,0.03


### 4. Count the frequency of each letter in each word in the word list

1. Split the words into letters with one column per letter
2. Melt the resulting separate letter columns into a single column with all of the letters
3. Use a groupby with a .count() aggregation to count the frequency of each letter in each word
4. Merge the original word list with the frequency information

In [213]:
#Split the words into letters
words['letter 1'] = words['7 letter word'].str[0]
words['letter 2'] = words['7 letter word'].str[1]
words['letter 3'] = words['7 letter word'].str[2]
words['letter 4'] = words['7 letter word'].str[3]
words['letter 5'] = words['7 letter word'].str[4]
words['letter 6'] = words['7 letter word'].str[5]
words['letter 7'] = words['7 letter word'].str[6]
words

Unnamed: 0,7 letter word,letter 1,letter 2,letter 3,letter 4,letter 5,letter 6,letter 7
0,ability,a,b,i,l,i,t,y
1,absence,a,b,s,e,n,c,e
2,academy,a,c,a,d,e,m,y
3,account,a,c,c,o,u,n,t
4,accused,a,c,c,u,s,e,d
...,...,...,...,...,...,...,...,...
968,quonked,q,u,o,n,k,e,d
969,quopped,q,u,o,p,p,e,d
970,rhizoma,r,h,i,z,o,m,a
971,rhizome,r,h,i,z,o,m,e


In [214]:
#Melt the resulting letter columns back to a single column
words = words.melt(id_vars='7 letter word', var_name='position', value_name='letter')
words

Unnamed: 0,7 letter word,position,letter
0,ability,letter 1,a
1,absence,letter 1,a
2,academy,letter 1,a
3,account,letter 1,a
4,accused,letter 1,a
...,...,...,...
6806,quonked,letter 7,d
6807,quopped,letter 7,d
6808,rhizoma,letter 7,a
6809,rhizome,letter 7,e


In [215]:
#Groupby the individual letter to count the number of letters in each word
gb = words.groupby(['7 letter word','letter']).count().reset_index()
gb = gb.rename(columns={'7 letter word' : 'word', 'position' : 'Count of Letter in Word'})
gb

Unnamed: 0,word,letter,Count of Letter in Word
0,Reading,R,1
1,Reading,a,1
2,Reading,d,1
3,Reading,e,1
4,Reading,g,1
...,...,...,...
5897,zythums,z,1
5898,zyzzyva,a,1
5899,zyzzyva,v,1
5900,zyzzyva,y,2


In [216]:
#Merge the original information with the count information from the groupby
words_2 = pd.merge(words, gb, left_on=['7 letter word', 'letter'], right_on=['word', 'letter'], how='left')
words_2

Unnamed: 0,7 letter word,position,letter,word,Count of Letter in Word
0,ability,letter 1,a,ability,1
1,absence,letter 1,a,absence,1
2,academy,letter 1,a,academy,2
3,account,letter 1,a,account,1
4,accused,letter 1,a,accused,1
...,...,...,...,...,...
6806,quonked,letter 7,d,quonked,1
6807,quopped,letter 7,d,quopped,1
6808,rhizoma,letter 7,a,rhizoma,1
6809,rhizome,letter 7,e,rhizome,1


In [217]:
#Peform tests to verify you haven't accidentally dropped information

words.nunique()

7 letter word    973
position           7
letter            27
dtype: int64

In [218]:
words_2.nunique()

7 letter word              973
position                     7
letter                      27
word                       973
Count of Letter in Word      4
dtype: int64

In [219]:
#Remove extra columns
words_2 = words_2[['word', 'letter', 'position', 'Count of Letter in Word']].copy()
words_2

Unnamed: 0,word,letter,position,Count of Letter in Word
0,ability,a,letter 1,1
1,absence,a,letter 1,1
2,academy,a,letter 1,2
3,account,a,letter 1,1
4,accused,a,letter 1,1
...,...,...,...,...
6806,quonked,d,letter 7,1
6807,quopped,d,letter 7,1
6808,rhizoma,a,letter 7,1
6809,rhizome,e,letter 7,1


### 5. Join each letter to its scrabble tile

1. Capitalize letters from the words table to facilitate join
2. Join both tables using pd.merge()

In [220]:
#Capitalize letters to prepare for join with letter/frequency input
words_2['letter'] = words_2['letter'].str.upper().str.strip()
words_2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,word,letter,position,Count of Letter in Word
0,ability,A,letter 1,1
1,absence,A,letter 1,1
2,academy,A,letter 1,2
3,account,A,letter 1,1
4,accused,A,letter 1,1
...,...,...,...,...
6806,quonked,D,letter 7,1
6807,quopped,D,letter 7,1
6808,rhizoma,A,letter 7,1
6809,rhizome,E,letter 7,1


In [221]:
joined = pd.merge(words_2, scores, how='left', left_on='letter', right_on='Tile')
joined

Unnamed: 0,word,letter,position,Count of Letter in Word,Points,Tile,Frequency,Total Tiles,Percent Chance
0,ability,A,letter 1,1,1,A,9,100,0.09
1,absence,A,letter 1,1,1,A,9,100,0.09
2,academy,A,letter 1,2,1,A,9,100,0.09
3,account,A,letter 1,1,1,A,9,100,0.09
4,accused,A,letter 1,1,1,A,9,100,0.09
...,...,...,...,...,...,...,...,...,...
6806,quonked,D,letter 7,1,2,D,4,100,0.04
6807,quopped,D,letter 7,1,2,D,4,100,0.04
6808,rhizoma,A,letter 7,1,1,A,9,100,0.09
6809,rhizome,E,letter 7,1,1,E,12,100,0.12


### 6. Update the % chance of drawing a letter based on the occurrences in that word

1. Remove words that are impossible due to not having the correct amount of letters in the tile set
2. Calculate probabilities in remaining words
    - Calculate the probability of each letter by first multiplying the probability of picking a specific letter multiple times
    - Calculate the overall word probability by multiplying each letter against each other as an aggregate function of a groupby
        - Also, sum up total points in each word in the same groupby

In [222]:
#First, remove all letters information (deaggregated) where there are fewer tiles than are necessary to form the word
joined = joined[joined['Count of Letter in Word'] <= joined['Frequency']]
joined

Unnamed: 0,word,letter,position,Count of Letter in Word,Points,Tile,Frequency,Total Tiles,Percent Chance
0,ability,A,letter 1,1,1,A,9,100,0.09
1,absence,A,letter 1,1,1,A,9,100,0.09
2,academy,A,letter 1,2,1,A,9,100,0.09
3,account,A,letter 1,1,1,A,9,100,0.09
4,accused,A,letter 1,1,1,A,9,100,0.09
...,...,...,...,...,...,...,...,...,...
6806,quonked,D,letter 7,1,2,D,4,100,0.04
6807,quopped,D,letter 7,1,2,D,4,100,0.04
6808,rhizoma,A,letter 7,1,1,A,9,100,0.09
6809,rhizome,E,letter 7,1,1,E,12,100,0.12


In [223]:
#Create a 2nd dataframe to hold just the words (unique) with how many valid letters they have

counts = joined['word'].value_counts().reset_index()
counts = counts.rename(columns={'word': 'value_counts', 'index' : 'word'})
counts

Unnamed: 0,word,value_counts
0,jointly,7
1,company,7
2,setting,7
3,perhaps,7
4,organic,7
...,...,...
968,pizazzy,4
969,maximum,4
970,zizzing,4
971,zizzled,4


In [224]:
#Filter the second dataframe to keep only the words with seven viable letters

counts = counts[counts['value_counts'] == 7]
counts

Unnamed: 0,word,value_counts
0,jointly,7
1,company,7
2,setting,7
3,perhaps,7
4,organic,7
...,...,...
802,foxlike,7
803,drawing,7
804,library,7
805,command,7


In [225]:
#Merge the joined data with the subset dataframe that has only viable words to use the viable words datfame as a filter

joined = pd.merge(counts, joined, how='left', on='word')
joined

Unnamed: 0,word,value_counts,letter,position,Count of Letter in Word,Points,Tile,Frequency,Total Tiles,Percent Chance
0,jointly,7,J,letter 1,1,8,J,1,100,0.01
1,jointly,7,O,letter 2,1,1,O,8,100,0.08
2,jointly,7,I,letter 3,1,1,I,9,100,0.09
3,jointly,7,N,letter 4,1,1,N,6,100,0.06
4,jointly,7,T,letter 5,1,1,T,6,100,0.06
...,...,...,...,...,...,...,...,...,...,...
5644,quicker,7,I,letter 3,1,1,I,9,100,0.09
5645,quicker,7,C,letter 4,1,3,C,2,100,0.02
5646,quicker,7,K,letter 5,1,5,K,1,100,0.01
5647,quicker,7,E,letter 6,1,1,E,12,100,0.12


In [226]:
#Remove extra columns

joined = joined[['word', 'letter', 'position', 'Count of Letter in Word', 'Points', 'Frequency', 'Percent Chance']]
joined.head()

Unnamed: 0,word,letter,position,Count of Letter in Word,Points,Frequency,Percent Chance
0,jointly,J,letter 1,1,8,1,0.01
1,jointly,O,letter 2,1,1,8,0.08
2,jointly,I,letter 3,1,1,9,0.09
3,jointly,N,letter 4,1,1,6,0.06
4,jointly,T,letter 5,1,1,6,0.06


In [227]:
joined['Count of Letter in Word'].value_counts()

1    4384
2    1136
3     129
Name: Count of Letter in Word, dtype: int64

In [228]:
# Define a probability function to use with .apply to determine the probability of a particular tile being pulled

def probability(count, chance):
    if count == 1:
        return chance
    elif count == 2:
        return chance**2.0
    elif count == 3:
        return chance**3.0
    elif count == 4:
        return chance**4.0

In [229]:
#Apply the probability function and save as a new column

joined['Updated Probability'] = joined.apply(lambda x: probability(x['Count of Letter in Word'], x['Percent Chance']), axis=1)
joined['Updated Probability']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0       0.01
1       0.08
2       0.09
3       0.06
4       0.06
        ... 
5644    0.09
5645    0.02
5646    0.01
5647    0.12
5648    0.06
Name: Updated Probability, Length: 5649, dtype: float64

In [230]:
#Convert points column to numeric

joined['Points'] = pd.to_numeric(joined['Points'])
joined

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,word,letter,position,Count of Letter in Word,Points,Frequency,Percent Chance,Updated Probability
0,jointly,J,letter 1,1,8,1,0.01,0.01
1,jointly,O,letter 2,1,1,8,0.08,0.08
2,jointly,I,letter 3,1,1,9,0.09,0.09
3,jointly,N,letter 4,1,1,6,0.06,0.06
4,jointly,T,letter 5,1,1,6,0.06,0.06
...,...,...,...,...,...,...,...,...
5644,quicker,I,letter 3,1,1,9,0.09,0.09
5645,quicker,C,letter 4,1,3,2,0.02,0.02
5646,quicker,K,letter 5,1,5,1,0.01,0.01
5647,quicker,E,letter 6,1,1,12,0.12,0.12


In [231]:
# Create a groupby to take the sum of points in a words letters and the product of updated probability

gb3 = joined.groupby('word').agg({'Points' : 'sum', 'Updated Probability' : 'prod'}).reset_index()
gb3

Unnamed: 0,word,Points,Updated Probability
0,Reading,9,4.199040e-09
1,ability,12,5.668704e-12
2,absence,11,1.791590e-11
3,academy,15,2.519424e-12
4,account,11,1.658880e-13
...,...,...,...
802,zymogen,22,6.912000e-11
803,zymomes,23,1.228800e-14
804,zymotic,23,3.456000e-11
805,zymurgy,25,2.304000e-15


In [233]:
#Transform the word back to lower case

gb3['word'] = gb3['word'].str.lower()
gb3

Unnamed: 0,word,Points,Updated Probability
0,reading,9,4.199040e-09
1,ability,12,5.668704e-12
2,absence,11,1.791590e-11
3,academy,15,2.519424e-12
4,account,11,1.658880e-13
...,...,...,...
802,zymogen,22,6.912000e-11
803,zymomes,23,1.228800e-14
804,zymotic,23,3.456000e-11
805,zymurgy,25,2.304000e-15


### 7. Rank the words by probability of choosing them and by the number of points they'll earn

1. Rank the words by number of probabilities using .rank()
2. Rank the words by points using .rank()

In [234]:
#Rank the updated probabilities using .rank()

gb3['Likelihood Rank'] = gb3['Updated Probability'].rank(ascending=False, method='dense')
gb3

Unnamed: 0,word,Points,Updated Probability,Likelihood Rank
0,reading,9,4.199040e-09,3.0
1,ability,12,5.668704e-12,164.0
2,absence,11,1.791590e-11,117.0
3,academy,15,2.519424e-12,187.0
4,account,11,1.658880e-13,251.0
...,...,...,...,...
802,zymogen,22,6.912000e-11,79.0
803,zymomes,23,1.228800e-14,321.0
804,zymotic,23,3.456000e-11,98.0
805,zymurgy,25,2.304000e-15,360.0


In [236]:
#Rank the most valuable words by using .rank()

gb3['Points Rank'] = gb3['Points'].rank(ascending=False, method='dense').sort_values('Ports Rank')
gb3.sort_values('Points Rank')

Unnamed: 0,word,Points,Updated Probability,Likelihood Rank,Points Rank
456,muzjiks,29,2.880000e-12,183.0,1.0
719,tzaddiq,27,1.244160e-13,261.0,2.0
97,cazique,27,7.776000e-11,75.0,2.0
374,jukebox,27,7.680000e-12,151.0,2.0
442,mezquit,27,5.184000e-11,88.0,2.0
...,...,...,...,...,...
407,leisure,7,7.166362e-11,77.0,22.0
460,natural,7,2.267482e-11,115.0,22.0
722,unusual,7,2.264924e-18,410.0,22.0
328,instant,7,5.441956e-14,285.0,22.0


### 8. Export to csv

In [237]:
#Export Results to csv

gb3.to_csv('pandas_solution.csv', index=False)