## Categorical Data

Categorical are a pandas data type corresponding variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values.
Example :- Gender, social class, blood type, country etc

### Object Creation

#### Series Creation

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
s = pd.Series(["a","b","c","a"],dtype = "category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [4]:
#By converting an existing Series or column to a category dtype
df = pd.DataFrame(
    {
        "A":["a","b","c","a"]
    }
)
df["B"]=df["A"].astype("category")
df

Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


**By using special functions, such as cut(), which groups data into discrete bins.**

In [12]:
df = pd.DataFrame(
    {
        "value":np.random.randint(0,100,10)
    }
)
df

Unnamed: 0,value
0,37
1,11
2,43
3,76
4,8
5,21
6,17
7,88
8,45
9,48


In [15]:
labels = [ "{0} - {1}".format(i,i+9) for i in range(0,100,10)]
df["group"]=pd.cut(df.value,range(0,105,10),right =False,labels = labels)
df.head() #498

Unnamed: 0,value,group
0,37,30 - 39
1,11,10 - 19
2,43,40 - 49
3,76,70 - 79
4,8,0 - 9


### Intro about Tiling and cut() method

The **cut()** function compare grouping for the values of the input array and is often used to transform continuous variables to discrete or categorical variables.

In [16]:
ages = np.array([10,15,13,12,23,25,28,59,60])
pd.cut(ages,bins=3)

[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]

If bins keyword is an integer, then equal-width bins are formed. Alternatively we can specify custom bin-edges.

In [17]:
c=pd.cut(ages,bins=[0,18,35,70])
c

[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]

If the bins keyword is an IntervalIndex, then these will be used to bin the passed data

In [18]:
pd.cut([25,20,50],bins=c.categories)

[(18, 35], (18, 35], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]

In [22]:
raw_cat =pd.Categorical(
    ["a","b","c","a"],categories=["b","c","d"],ordered = False
)
raw_cat

[NaN, 'b', 'c', NaN]
Categories (3, object): ['b', 'c', 'd']