[Handling Machine Learning Categorical Data with Python Tutorial](https://www.datacamp.com/tutorial/categorical-data)

In [24]:


import numpy as np
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv")


In [8]:
data.head()

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


In [7]:
data.dtypes

Carat Weight    float64
Cut              object
Color            object
Clarity          object
Polish           object
Symmetry         object
Report           object
Price             int64
dtype: object

In [12]:
data['Cut'].value_counts()

Ideal              2482
Very Good          2428
Good                708
Signature-Ideal     253
Fair                129
Name: Cut, dtype: int64

In [13]:
import plotly.express as px
cut_counts = data['Cut'].value_counts()
fig = px.bar(x=cut_counts.index, y=cut_counts.values)
fig.show()


In [16]:
data.groupby("Cut").mean()

Unnamed: 0_level_0,Carat Weight,Price
Cut,Unnamed: 1_level_1,Unnamed: 2_level_1
Fair,1.058682,5886.178295
Good,1.268927,9326.65678
Ideal,1.382293,13127.331185
Signature-Ideal,1.205217,11541.525692
Very Good,1.332941,11484.69687


In [21]:
pd.crosstab(index=data['Cut'], columns=data['Color'])

Color,D,E,F,G,H,I
Cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fair,12,32,24,21,24,16
Good,74,110,133,148,128,115
Ideal,280,278,363,690,458,413
Signature-Ideal,30,35,38,64,45,41
Very Good,265,323,455,578,424,383


In [27]:
pd.pivot_table(data, values='Price', index='Cut', columns='Color', aggfunc=np.mean)

Color,D,E,F,G,H,I
Cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fair,6058.25,5370.625,6063.625,7345.52381,5908.5,4573.1875
Good,10058.716216,8969.545455,9274.007519,9988.614865,9535.132812,8174.113043
Ideal,18461.953571,12647.107914,14729.426997,13570.310145,11527.700873,9459.588378
Signature-Ideal,19823.1,11261.914286,13247.947368,10248.296875,9112.688889,8823.463415
Very Good,13218.826415,12101.910217,12413.905495,12354.013841,10056.106132,8930.031332


**Pandas Categorical to numeric creating dummies**

In [39]:
data = {
    "fruit": ["apple", "banana", "orange", "apple"]
}

# show head
df = pd.DataFrame(data)
df.head()

Unnamed: 0,fruit
0,apple
1,banana
2,orange
3,apple


In [40]:
df_encoded = pd.get_dummies(df["fruit"])
df_encoded.head()

Unnamed: 0,apple,banana,orange
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0


**For Machine learning better to use sklearn to convert dummies.**

In [43]:
# using one hot encoder.


# one-hot-encode using sklearn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_results = encoder.fit_transform(df).toarray()
encoded_results

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])