Problem Statement No. 07
Perform the following operations using Python on any open source dataset.
1. Import all the required Python Libraries.
2. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data to get some initial statistics. Provide variable descriptions.
Types of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data
type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python. In addition to the codes and outputs, explain every
operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

In [1]:
import pandas as pd
import numpy as np

In [19]:
df = pd.read_csv("nba.csv")

In [4]:
# Display the first few rows of the DataFrame
print(df.head())

  Name,Team,Number,Position,Age,Height,Weight,College,Salary
0  Avery Bradley,Boston Celtics,0,PG,25,2-Jun,180...        
1  Jae Crowder,Boston Celtics,99,SF,25,6-Jun,235,...        
2  John Holland,Boston Celtics,30,SG,27,5-Jun,205...        
3  R.J. Hunter,Boston Celtics,28,SG,22,5-Jun,185,...        
4  Jonas Jerebko,Boston Celtics,8,PF,29,10-Jun,23...        


In [5]:
# Checking for missing values and providing variable descriptions
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 457 entries, 0 to 456
Data columns (total 1 columns):
 #   Column                                                      Non-Null Count  Dtype 
---  ------                                                      --------------  ----- 
 0   Name,Team,Number,Position,Age,Height,Weight,College,Salary  457 non-null    object
dtypes: object(1)
memory usage: 3.7+ KB
None


In [6]:
# Summary statistics for numerical variables
print(df.describe())

       Name,Team,Number,Position,Age,Height,Weight,College,Salary
count                                                 457        
unique                                                457        
top     Avery Bradley,Boston Celtics,0,PG,25,2-Jun,180...        
freq                                                    1        


In [7]:
# Check the dimensions of the DataFrame
print("Dimensions of the DataFrame:", df.shape)

Dimensions of the DataFrame: (457, 1)


So, in this case, the DataFrame has 457 rows and 1 column. This means that the dataset contains information for 457 NBA players, with each player's information stored in a single column.

#To summarize the types of variables and perform necessary data type conversions, we'll first examine the data types of each column in the DataFrame and then apply appropriate conversions if needed. Here's how we can do it:

In [20]:
# Check data types of variables
print(df.dtypes)

Name         object
Team         object
Number        int64
Position     object
Age           int64
Height       object
Weight        int64
College      object
Salary      float64
dtype: object


In [22]:
df['Height'] = pd.to_numeric(df['Height'], errors='coerce')

In [23]:
# Display data types after conversion
print(df.dtypes)

Name         object
Team         object
Number        int64
Position     object
Age           int64
Height      float64
Weight        int64
College      object
Salary      float64
dtype: object


# Turn categorical variables into quantitative variables

In [25]:
# Step 1: Importing Libraries
from sklearn.preprocessing import OneHotEncoder

# Step 4: One-Hot Encoding using Scikit-learn
# Assuming 'Team' is a categorical variable
# Creating an instance of OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the 'Team' column
team_encoded = encoder.fit_transform(df[['Team']])

# Create a DataFrame from the encoded array
team_encoded_df = pd.DataFrame(team_encoded, columns=encoder.get_feature_names_out(['Team']))

# Concatenate the encoded DataFrame with the original DataFrame
df_encoded = pd.concat([df, team_encoded_df], axis=1)

# Drop the original 'Team' column
df_encoded.drop(columns=['Team'], inplace=True)

# Displaying the transformed DataFrame
print(df_encoded.head())

            Name  Number Position  Age  Height  Weight            College  \
0  Avery Bradley       0       PG   25     NaN     180              Texas   
1    Jae Crowder      99       SF   25     NaN     235          Marquette   
2   John Holland      30       SG   27     NaN     205  Boston University   
3    R.J. Hunter      28       SG   22     NaN     185      Georgia State   
4  Jonas Jerebko       8       PF   29     NaN     231                NaN   

      Salary  Team_Atlanta Hawks  Team_Boston Celtics  ...  \
0  7730337.0                 0.0                  1.0  ...   
1  6796117.0                 0.0                  1.0  ...   
2        NaN                 0.0                  1.0  ...   
3  1148640.0                 0.0                  1.0  ...   
4  5000000.0                 0.0                  1.0  ...   

   Team_Oklahoma City Thunder  Team_Orlando Magic  Team_Philadelphia 76ers  \
0                         0.0                 0.0                      0.0   
1       



In [26]:
# Step 1: Importing Libraries
from sklearn.preprocessing import OneHotEncoder

# Step 4: One-Hot Encoding using Scikit-learn
# Assuming 'Team' is a categorical variable
# Creating an instance of OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the 'Team' column
team_encoded = encoder.fit_transform(df[['Team']])

# Create a DataFrame from the encoded array
team_encoded_df = pd.DataFrame(team_encoded, columns=encoder.get_feature_names_out(['Team']))

# Concatenate the encoded DataFrame with the original DataFrame
df_encoded = pd.concat([df, team_encoded_df], axis=1)

# Drop the original 'Team' column
df_encoded.drop(columns=['Team'], inplace=True)

# Displaying the transformed DataFrame
print(df_encoded.head())

            Name  Number Position  Age  Height  Weight            College  \
0  Avery Bradley       0       PG   25     NaN     180              Texas   
1    Jae Crowder      99       SF   25     NaN     235          Marquette   
2   John Holland      30       SG   27     NaN     205  Boston University   
3    R.J. Hunter      28       SG   22     NaN     185      Georgia State   
4  Jonas Jerebko       8       PF   29     NaN     231                NaN   

      Salary  Team_Atlanta Hawks  Team_Boston Celtics  ...  \
0  7730337.0                 0.0                  1.0  ...   
1  6796117.0                 0.0                  1.0  ...   
2        NaN                 0.0                  1.0  ...   
3  1148640.0                 0.0                  1.0  ...   
4  5000000.0                 0.0                  1.0  ...   

   Team_Oklahoma City Thunder  Team_Orlando Magic  Team_Philadelphia 76ers  \
0                         0.0                 0.0                      0.0   
1       