2]Regression Analysis:(Any one)



B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis b. Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d. Also compare the results of the above analysis for the two data sets Dataset link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database



To address your request, I'll guide you through each step of the analysis for the Pima Indians Diabetes Dataset and Diabetes Dataset from UCI, using Python. I'll structure the code as follows:

Univariate Analysis - Calculate statistics such as frequency, mean, median, mode, variance, standard deviation, skewness, and kurtosis.
Bivariate Analysis - Use both linear regression and logistic regression modeling.
Multiple Regression Analysis - Perform multiple regression using both datasets.
Comparative Analysis - Compare the results of the analyses between the two datasets.
Let's start by importing the datasets and required libraries and proceed through each part.

In [3]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis, mode
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

# Load the dataset
df = pd.read_csv('diabetes.csv')

# Display the first few rows of the dataset
print("Pima Indians Diabetes Dataset:")
print(df.head())

# Part A: Univariate Analysis
def univariate_analysis(df):
    for column in df.columns:
        if df[column].dtype in [np.float64, np.int64]:
            print(f"\nAnalysis of {column}:")
            print(f"Frequency:\n{df[column].value_counts()}")
            print(f"Mean: {df[column].mean()}")
            print(f"Median: {df[column].median()}")
            
            # Using pandas to get mode to avoid issues with scipy's mode function
            mode_value = df[column].mode().iloc[0] if not df[column].mode().empty else "No mode"
            print(f"Mode: {mode_value}")
            
            print(f"Variance: {df[column].var()}")
            print(f"Standard Deviation: {df[column].std()}")
            print(f"Skewness: {skew(df[column])}")
            print(f"Kurtosis: {kurtosis(df[column])}")

# Run univariate analysis
print("\nUnivariate Analysis on Pima Indians Diabetes Dataset")
univariate_analysis(df)



# Part B: Bivariate Analysis (Linear Regression)
def linear_regression_analysis(df, target_column):
    # Selecting numeric features and target variable
    features = df.drop(columns=[target_column])
    target = df[target_column]
    
    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
    
    # Linear Regression Model
    linear_model = LinearRegression()
    linear_model.fit(X_train, y_train)
    y_pred = linear_model.predict(X_test)
    
    # Model Evaluation
    print(f"\nLinear Regression Analysis on {target_column}")
    print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
    print(f"Model Coefficients: {linear_model.coef_}")
    print(f"Intercept: {linear_model.intercept_}")

# Run linear regression (assuming 'Outcome' as the target variable)
print("\nLinear Regression on Pima Indians Diabetes Dataset")
linear_regression_analysis(df, 'Outcome')

# Bivariate Analysis (Logistic Regression)
def logistic_regression_analysis(df, target_column):
    features = df.drop(columns=[target_column])
    target = df[target_column]
    
    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
    
    # Logistic Regression Model
    logistic_model = LogisticRegression(max_iter=1000)
    logistic_model.fit(X_train, y_train)
    y_pred = logistic_model.predict(X_test)
    
    # Model Evaluation
    print(f"\nLogistic Regression Analysis on {target_column}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

# Run logistic regression
print("\nLogistic Regression on Pima Indians Diabetes Dataset")
logistic_regression_analysis(df, 'Outcome')

# Part C: Multiple Regression Analysis
def multiple_regression_analysis(df, target_column):
    features = df.drop(columns=[target_column])
    target = df[target_column]
    
    # Adding a constant term for intercept in statsmodels
    X = sm.add_constant(features)
    
    # Fit the model
    model = sm.OLS(target, X).fit()
    print("\nMultiple Regression Analysis")
    print(model.summary())

# Run multiple regression
print("\nMultiple Regression on Pima Indians Diabetes Dataset")
multiple_regression_analysis(df, 'Outcome')


Pima Indians Diabetes Dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

Univariate Analysis on Pima Indians Diabetes Dataset

Analysis of Pregnancies:
Frequency:
Pregnancies
1     135
0     111
2     103
3      75
4      68
5      57
6      50
7      45
8      38
9      28
10     24
11     11
13     10
12      9
14      2
15      1
17      1