In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [2]:
df = pd.read_csv('cleaned.csv')
df

Unnamed: 0.1,Unnamed: 0,name,hp,atk,def,spa,spd,spe,BST,weightkg,type1,type2,generation,tier,tiernum
0,0,Venusaur,80,82,83,100,100,80,525,100.0,Grass,Poison,1,PU,3
1,1,Charizard,78,84,78,109,85,100,534,90.5,Fire,Flying,1,ZU,1
2,2,Blastoise,79,83,100,85,105,78,530,85.5,Water,Water,1,RUBL,8
3,3,Arbok,60,95,69,65,79,80,448,65.0,Poison,Poison,1,ZU,1
4,4,Pikachu,35,55,40,50,50,90,320,6.0,Electric,Electric,1,ZU,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
485,485,Iron Boulder,90,120,80,68,108,124,590,162.5,Rock,Psychic,9,UUBL,10
486,486,Iron Crown,90,72,100,122,108,98,590,156.0,Steel,Psychic,9,OU,11
487,487,Terapagos,90,65,85,65,85,60,450,6.5,Normal,Normal,9,Uber,12
488,488,Terapagos-Stellar,160,105,110,130,110,85,700,77.0,Normal,Normal,9,Uber,12


In [3]:
# tiers
tiers = ['AG', 'Uber', 'OU', 'UUBL', 'UU', 'RUBL', 'RU', 'NUBL', 'NU', 'PUBL', 'PU', 'ZUBL','ZU']
tiers = tiers[::-1] # yes i copied this from the previous notebook

px.histogram(df,x='tier',title='Smogon Tier Distribution',category_orders={'tier':tiers})

There isn't really much of a distribution here, other than the fact that the vast majority of Pokemon are left to waste away in the ZU tier. A good optimization to make in the future would be to adjust the "tier numbers" to aim for a more uniform distribution, or to incorporate usage rates within each tier as well in order to better seperate all of the pokemon clumped together in ZU.

In [4]:
# stats
stats = ['hp','atk','def','spa','spd','spe','BST']
for i in stats:
  fig = px.scatter(df, x=i, y='tiernum', hover_name='name', trendline='lowess',title = f'{i} vs Tier Number',
                      labels={
                                "tiernum": "Tier Number",
                            }
  )
  fig.show()

BST seems to be at least decently related to a Pokemon's viability. At the very least, a Pokemon with extremely low or extremely high stat totals can be easily predicted to be in ZU or UU-Ubers, respectively. Unfortunately, the semblance is weaker in the middle, as there are too many Pokemon spread out across the ~500 BST in every tier.

Individual stats do not appear to have a strong correlation with a Pokemon's viability. To a certain extent, this makes sense; save for the best Pokemon with extremely high totals, having a single extremely high stat inevitably requires sacrifices elsewhere, which may destroy a Pokemon's usability.

In [5]:
px.scatter(df,x='weightkg',y='tiernum',hover_name='name',trendline='lowess', title = 'Weight vs. Tier', labels={'weightkg':'Weight','tiernum':'Tier Number'})

For weight, I felt that it would be better to log it, since the vast majority of Pokemon have a weight under 200 kg and it's only a select few that reach extremely high numbers.

The correlation is pretty weak, but there is still *some*. Low tier Pokemon can be found across the entire weight range, but all of the Pokemon in the higher tiers are at least somewhat heavy.

Although there *are* moves that depend on a Pokemon's weight for damange calculations, I think it's unlikely that weight is actually determining a Pokemon's viability, given that many heavy Pokemon are still bottom tier.

Rather, I think the opposite is more likely; strong Pokemon just tend to have larger monster designs by default, and then as a result weight-based moves just happen to be viable options for said Pokemon's movesets. In other words, strong Pokemon being heavy is what causes the prevalence of weight-based moves, not the other way around.

In [6]:
px.scatter(df,x='weightkg',y='tiernum',hover_name='name',trendline='lowess',log_x=True, title = 'Weight (logged) vs. Tier', labels={'weightkg':'Weight','tiernum':'Tier Number'})

In [7]:
df['weightlog']=np.log(df['weightkg'])
px.scatter(df,x='weightlog',y='tiernum',hover_name='name',trendline='lowess', title = 'Logged Weight vs. Tier', labels={'weightlog':'Logged Weight','tiernum':'Tier Number'})


Just by eyeballing the graph, some indication of a correlation betwen generation and viability becomes apparent, particularly on the tiers at the ends. OU and Uber have a significantly higher amount of gen8 and 9 Pokemon, while gen1 has the largest share of ZU Pokemon.

However, the amount of generations makes it difficult for a singular histogram to generate that much information. Furthermore, while similar in size, not every generation has the same number of Pokemon, so it would be better to chart the generations individually.

In [8]:
tiers = ['AG', 'Uber', 'OU', 'UUBL', 'UU', 'RUBL', 'RU', 'NUBL', 'NU', 'PUBL', 'PU', 'ZUBL','ZU']
tiers = tiers[::-1]

df['genstr'] = df['generation'].astype(str) # this makes px treat generation as a categorical variable with 9 values, rather than a numerical value

px.histogram(df,x='tier',color='genstr',category_orders={'tier':tiers,'genstr': [str(i) for i in range(1, 10)]}, labels={'genstr':'generation'}, title='Smogon Tier Distributions by Generation')

The correlations are much more apparent when looking at its generation seperately; gen9, and gen8 to a lesser extent have MUCH more OU and Uber Pokemon, relative to the other generations.

Earlier generations have way more ZU Pokemon; close to 2/3 of the currently available gen1 Pokemon are classified in the ZU tier.

In [9]:
fig = make_subplots(rows=3, cols=3, subplot_titles=([f'Generation {i}'for i in range(1, 10)]))


for i in range(3):
    for j in range(3):

        temp = df[df['generation']==3*i+j+1] # keeping the tiers sorted from lowest (ZU) to highest
        temp['tier'] = pd.Categorical(
        temp['tier'],
        categories=tiers,
        ordered=True
    )
        temp.sort_values('tier',inplace=True)

        fig.add_trace(go.Histogram(x=temp['tier']), row=i+1, col=j+1)

fig.update_layout(title="Smogon Tier Distributions, Split by Generation", width=2000,height=1250, showlegend=False)



fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Once again, this combined histogram doesn't seem to give much information; other than ZU the tiers are too small to really pick out specific types.

In [10]:
px.histogram(df,x='tier',color='type1',category_orders={'tier':tiers}, title='Smogon Tier Distributions by Type')

For this graph, I included any Pokemon with the corresponding type in either column.

The results are more or less as expected, types considered to be 'bad' like grass, normal, rock, bug, etc. have results that skew towards the ZU/PU end, while the widely-accepted two best types in steel and fairy both have very strong showings.

Worth pointing out is the dragon typing; while it is *good* but no longer *incredible* (the fairy typing was added in gen6 specifically to reduce dragon's strength), so many strong 'legendary' Pokemon are dragon typing that it ends up looking extremely good on this graph.

In [11]:
types = ["Normal", "Fighting", "Flying", "Poison", "Ground", "Rock", "Bug", "Ghost", "Steel", "Fire", "Water", "Grass", "Electric", "Psychic", "Ice", "Dragon", "Dark", "Fairy"]


fig = make_subplots(rows=3, cols=6, subplot_titles=types)

for i in range(3):
    for j in range(6):

        temp = df[(df['type1']==types[6*i+j]) | (df['type2']==types[6*i+j])]


        temp['tier'] = pd.Categorical(
        temp['tier'],
        categories=tiers,
        ordered=True
    )
        temp.sort_values('tier',inplace=True)

        fig.add_trace(go.Histogram(x=temp['tier'], name = types[6*i+j]), row=i+1, col=j+1)

fig.update_layout(title="Smogon Tier Distributions, Split by Typing", width=1500,height=1000, showlegend=False)



fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [12]:
types = ["Normal", "Fighting", "Flying", "Poison", "Ground", "Rock", "Bug", "Ghost", "Steel", "Fire", "Water", "Grass", "Electric", "Psychic", "Ice", "Dragon", "Dark", "Fairy"]


fig = make_subplots(rows=3, cols=6, subplot_titles=types)

for i in range(3):
    for j in range(6):

        temp = df[df['type1']==types[6*i+j]]
        temp['tier'] = pd.Categorical(
        temp['tier'],
        categories=tiers,
        ordered=True
    )
        temp.sort_values('tier',inplace=True)

        fig.add_trace(go.Histogram(x=temp['tier'], name = types[6*i+j]), row=i+1, col=j+1)

fig.update_layout(title="Smogon Tier Distributions, Split by Main typing", width=2000,height=1250, showlegend=False)



fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



There isn't a particularly big difference when split between primary and secondary typings; good typings are still good, and bad typings are still bad. While not neccesarily related to our modeling, it is interesting to notice how flying is almost never the primary type, and bug is never the secondary typing.

In [13]:
types = ["Normal", "Fighting", "Flying", "Poison", "Ground", "Rock", "Bug", "Ghost", "Steel", "Fire", "Water", "Grass", "Electric", "Psychic", "Ice", "Dragon", "Dark", "Fairy"]


fig = make_subplots(rows=3, cols=6, subplot_titles=types)

for i in range(3):
    for j in range(6):

        temp = df[df['type2']==types[6*i+j]]
        temp['tier'] = pd.Categorical(
        temp['tier'],
        categories=tiers,
        ordered=True
    )
        temp.sort_values('tier',inplace=True)

        fig.add_trace(go.Histogram(x=temp['tier'], name=types[6*i+j]), row=i+1, col=j+1)

fig.update_layout(title="Smogon Tier Distributions, Split by Secondary Typing", width=2000,height=1250, showlegend=False)



fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [14]:
# save changes (logging weight)
df.to_csv('cleaned.csv')