# <span style="color: rgb(232 121 249); text-align: center;"> <center> Genre Predict </center> </span>

<span style="color: rgb(248 113 113); font-size: 18px;" >Problem:</span> We have an online music store when our sign up we ask their age and gender and based on their profile. we recommend various Music albums they're likely to buy. So in this project we want to use machine to increase sales. 

So we want to build a model. We feed this model with some sample data based on the existing users. Our model will learn the patterns in our data, so we can ask it to make predictions. When a user signs up we tell our model, hey we have a new user with this profile. What is the kind of music that this user is iterested in. Our model will say jazz or hip hop or whatever and based on that. We can make suggestions to the user so this is the problem we're going to solve.


<span style="color: rgb(52 211 153); font-size: 17px;" >Steps:</span>
    
    1. Inspect and Preparing the data for model.
    2. Build the model for Learing and Predicting.
    3. Calculating the Accuracy.
    4. Persisting Models.
    5. Visualizing the Model.

In [2]:
# Import necessary module.
import pandas as pd 

In [3]:
# Load data to DataFrame: music_data
music_data = pd.read_csv("music.csv")

## <span style="text-align: center;"> <center> step 1: Inspect and Preparing the data </center>

In [4]:
# Data inspection
music_data

Unnamed: 0,age,gender,genre
0,20,Male,HipHop
1,23,Male,HipHop
2,25,Male,HipHop
3,26,Male,Jazz
4,29,Male,Jazz
5,30,Male,Jazz
6,31,Male,Classical
7,33,Male,Classical
8,37,Male,Classical
9,20,Female,Dance


In [5]:
# Dispaly rows and columns
print("Data dimensions:", music_data.shape)
print("Number of columns are:", music_data.shape[1])
print("Number of rows are: ", music_data.shape[0], end='\n\n')

Data dimensions: (18, 3)
Number of columns are: 3
Number of rows are:  18



### <span style="text-align: center;"> <center> Cleaning data (Preparing the data) </center>

In [16]:
# Modifications in original data will not reflected in copied data
clean_data = music_data.copy()

# Inspect the summary of data
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     18 non-null     int64 
 1   gender  18 non-null     object
 2   genre   18 non-null     object
dtypes: int64(1), object(2)
memory usage: 560.0+ bytes


In [17]:
# Check null or missing values
clean_data.isnull().sum()

age       0
gender    0
genre     0
dtype: int64

There are no null or mission values in any column

In [20]:
# Check Duplicate data 
clean_data.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
dtype: bool

There are no duplicate data in any row

In [21]:
# Display infomation about the music data.
music_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     18 non-null     int64 
 1   gender  18 non-null     object
 2   genre   18 non-null     object
dtypes: int64(1), object(2)
memory usage: 560.0+ bytes


In this music data, there is no missing value or duplicate data So, we just need to convert the data object into numbers.

From the above data summary we can see that age column is integer and other columns are gender and genre is object.

**Gender is a Discrete (Categorical) variable**

**Genre is Dependent (Categorical) variable**

**Note:** Most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.


> <span style="color: rgb(21 94 117); font-size: 16px;"> How should we select encoding methods is depends algorithm(s) we apply: </span>

- <span style="color: rgb(3 105 161);">  Some algorithms can work with categorical data directly e.g LightGBM, CatBoost, or For example, a dicision tree can be learned directly from categorical data with no date transform required (this depends on the specific implementation). </span>

- <span style="color: rgb(3 105 161);"> Many machine learning algorithms cannot operate on lable data directly. They require all input variables and output variables to be numeric. </span> 

- <span style="color: rgb(3 105 161);"> Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement. </span>

- <span style="color: rgb(3 105 161);"> If we categorize algorithms to linear and tree based models we should consider that generally linear models are sensitive to order of ordinal data so we should select appropriate encoding methods. </span>

__[An Overview of Categorical Encoding Methods](https://www.kaggle.com/code/arashnic/an-overview-of-categorical-encoding-methods/notebook)__

In [22]:
# Generate list of object columns name in Music data
categorical_columns = [c for c in clean_data.columns if clean_data[c].dtypes == 'object']

categorical_columns

['gender', 'genre']

In [32]:
# Difine function: Convert object to numeric
def objToNum(col):
    '''
    Generate unique value of column, replace the string (object) value with 
    index(integer) of unique value .
    '''
    possible_labels = clean_data[col].unique()
    label_dict = {}
    
    for index, possible_label in enumerate(possible_labels):
        label_dict[possible_label] = index
        
    clean_data[col] = clean_data[col].replace(label_dict)
    
    return label_dict

In [33]:
# Convert Object ot Numeric
for i in categorical_columns:
    print(objToNum(i))
    

{'Male': 0, 'Female': 1}
{'HipHop': 0, 'Jazz': 1, 'Classical': 2, 'Dance': 3, 'Acoustic': 4}


Now 0 represents the `Male` and 1 represents the `Female`, and the Genre column represents, respectively.

In [35]:
clean_data

Unnamed: 0,age,gender,genre
0,20,0,0
1,23,0,0
2,25,0,0
3,26,0,1
4,29,0,1
5,30,0,1
6,31,0,2
7,33,0,2
8,37,0,2
9,20,1,3


In [36]:
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     18 non-null     int64
 1   gender  18 non-null     int64
 2   genre   18 non-null     int64
dtypes: int64(3)
memory usage: 560.0 bytes


All values are Numaric, now we need to split the datasets for training, test for model learning and predicting.

In [40]:
# Create train(X) dataframe: without genre column
X = clean_data.drop(columns=['genre'])

# Create test(y) series: only genre column
y = clean_data['genre']
