## Exploratory Data Analysis (EDA)
Identified and addressed missing values in the dataset using the `isnull().sum()` function. Ensured data completeness and integrity by handling missing values appropriately.

For the dataset information, I obtained an overview of the dataset structure, data types, and non-null counts using the `.info()` method. Then, I gained insights into the dataset's characteristics and potential data preprocessing requirements. I did descriptive statistics, conducting a summary statistical analysis of numerical features using the `.describe()` method. Afterwards, I got key statistical metrics such as mean, median, minimum, maximum, and quartiles for numerical variables.

### Data Cleaning:
I identified and dropped unnecessary columns from the dataset, such as the 'Unnamed: 0' column, using appropriate methods. I improved dataset clarity and reduced dimensionality by removing redundant or irrelevant columns.

The EDA process provided valuable insights into the dataset's structure, content, and quality, laying the groundwork for further analysis and modeling tasks. The dataset contained 24 columns with various data types. Columns like 'Unnamed: 0' were removed as they didn't provide useful information. The 'release_date' column was converted to datetime format for further analysis. Data types and column names were checked to ensure consistency.

In [2]:
import pandas as pd
df = pd.read_csv("../data/train.csv")
df.head(15000)

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,communication,obscene,music,movement/places,light/visual perceptions,family/spiritual,sadness,feelings,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.263751,0.000598,0.039288,0.000598,0.000598,0.000598,0.380299,0.117175,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.118034,0.001284,0.212681,0.051124,0.001284,0.001284,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.002770,0.002770,0.002770,...,0.250668,0.002770,0.323794,0.002770,0.002770,0.002770,0.002770,0.225422,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.001548,0.001548,0.001548,0.129250,0.001548,0.001548,0.225889,0.001548,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.001350,0.001350,0.417772,...,0.001350,0.001350,0.001350,0.001350,0.001350,0.029755,0.068800,0.001350,romantic,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,47896,dr. john,milneburg joys,1992,blues,soul milneburg joy soul milneburg joy play mam...,31,0.002770,0.002770,0.002770,...,0.002770,0.002770,0.424345,0.002770,0.002770,0.002770,0.002770,0.108222,music,0.4
14996,47897,dead moon,fire in the western world,1992,blues,moan wind blow hard better warn cause time go ...,67,0.001144,0.447526,0.164416,...,0.092367,0.001144,0.001144,0.179904,0.001144,0.001144,0.001144,0.001144,violence,0.4
14997,47899,dr. john,since i fell for you,1992,blues,know darlin leave home take go fell bring mise...,33,0.151763,0.001645,0.001645,...,0.001645,0.001645,0.144977,0.001645,0.161957,0.001645,0.419940,0.001645,sadness,0.4
14998,47901,santana,milagro,1992,blues,work work work work heal people music free peo...,22,0.002288,0.002288,0.002288,...,0.002288,0.002288,0.350114,0.002288,0.002288,0.350114,0.002288,0.002288,music,0.4


In [30]:
df.isnull().sum()

Unnamed: 0                  0
artist_name                 0
track_name                  0
release_date                0
genre                       0
lyrics                      0
len                         0
dating                      0
violence                    0
world/life                  0
night/time                  0
shake the audience          0
family/gospel               0
romantic                    0
communication               0
obscene                     0
music                       0
movement/places             0
light/visual perceptions    0
family/spiritual            0
sadness                     0
feelings                    0
topic                       0
age                         0
dtype: int64

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28362 entries, 0 to 28361
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                28362 non-null  int64  
 1   artist_name               28362 non-null  object 
 2   track_name                28362 non-null  object 
 3   release_date              28362 non-null  int64  
 4   genre                     28362 non-null  object 
 5   lyrics                    28362 non-null  object 
 6   len                       28362 non-null  int64  
 7   dating                    28362 non-null  float64
 8   violence                  28362 non-null  float64
 9   world/life                28362 non-null  float64
 10  night/time                28362 non-null  float64
 11  shake the audience        28362 non-null  float64
 12  family/gospel             28362 non-null  float64
 13  romantic                  28362 non-null  float64
 14  commun

In [22]:
df.describe()

Unnamed: 0.1,Unnamed: 0,release_date,len,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,obscene,music,movement/places,light/visual perceptions,family/spiritual,sadness,feelings,age
count,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0,28362.0
mean,42948.166878,1990.239652,73.030534,0.02111,0.118371,0.120984,0.057356,0.017418,0.017045,0.048676,0.076651,0.097185,0.060067,0.047417,0.049008,0.024155,0.129402,0.030995,0.425148
std,24747.811462,18.486997,41.831605,0.052366,0.178658,0.172216,0.111892,0.040658,0.041968,0.106071,0.109497,0.181314,0.123346,0.091559,0.089553,0.051032,0.181149,0.071656,0.2641
min,0.0,1950.0,1.0,0.000291,0.000284,0.000291,0.000289,0.000284,0.000289,0.000284,0.000291,0.000289,0.000289,0.000284,0.000284,0.000284,0.000284,0.000289,0.014286
25%,20393.5,1975.0,42.0,0.000923,0.00112,0.00117,0.001032,0.000993,0.000923,0.000975,0.001144,0.001053,0.000975,0.000993,0.000993,0.000957,0.001144,0.000993,0.185714
50%,45407.0,1991.0,63.0,0.001462,0.002506,0.006579,0.001949,0.001595,0.001504,0.001754,0.002632,0.001815,0.001815,0.001645,0.001815,0.001645,0.005263,0.001754,0.414286
75%,64089.5,2007.0,93.0,0.004049,0.192538,0.197854,0.065778,0.009989,0.004785,0.042304,0.132111,0.088799,0.055109,0.054373,0.064302,0.025515,0.235115,0.032617,0.642857
max,82451.0,2019.0,199.0,0.647706,0.981781,0.962105,0.973684,0.497463,0.545303,0.940789,0.645829,0.992298,0.956938,0.638021,0.667782,0.618073,0.981424,0.95881,1.0


In [31]:
def clean_artist_data(df):
    df = df.drop(columns=['Unnamed: 0', 'lyrics'])
    return df

cleaned_df = clean_artist_data(df)
cleaned_df.to_csv('cleaned_data.csv', index=False)
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28362 entries, 0 to 28361
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   artist_name               28362 non-null  object 
 1   track_name                28362 non-null  object 
 2   release_date              28362 non-null  int64  
 3   genre                     28362 non-null  object 
 4   len                       28362 non-null  int64  
 5   dating                    28362 non-null  float64
 6   violence                  28362 non-null  float64
 7   world/life                28362 non-null  float64
 8   night/time                28362 non-null  float64
 9   shake the audience        28362 non-null  float64
 10  family/gospel             28362 non-null  float64
 11  romantic                  28362 non-null  float64
 12  communication             28362 non-null  float64
 13  obscene                   28362 non-null  float64
 14  music 