# 1. Supervised Learning

**Importing libraries and data**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

data = pd.read_excel("Concrete_Data.xls")

**Split data into training and testing sets**

In [None]:
# Split the data into features (X) and target variable (y)
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]  # Last column

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Linear Regression Model**

In [None]:
# Creating a linear regression model
lin_reg = LinearRegression()

# Training the model on the training data
lin_reg.fit(X_train, y_train)

# Making predictions on the testing data
y_pred = lin_reg.predict(X_test)

# Evaluating the model performance
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Printing the results
print("Linear Regression - Mean Absolute Error:", mae)
print("Linear Regression - R-squared:", r2)

Linear Regression - Mean Absolute Error: 7.745392872421345
Linear Regression - R-squared: 0.627541605542902


**Polynomial Regression Model**

In [None]:
# Creating a polynomial features object
poly = PolynomialFeatures(degree=2)

# Transforming the training and testing data
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Creating a polynomial regression model
poly_reg = LinearRegression()

# Training the model on the training data
poly_reg.fit(X_train_poly, y_train)

# Making predictions on the testing data
y_pred_poly = poly_reg.predict(X_test_poly)

# Evaluating the model performance
mae_poly = mean_absolute_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

# Printing the results
print("\nPolynomial Regression - Mean Absolute Error:", mae_poly)
print("Polynomial Regression - R-squared:", r2_poly)

# Comparing models
if mae_poly < mae:
  print("Polynomial Regression performs better than Linear Regression")
else:
  print("Linear Regression performs better than Polynomial Regression")


Polynomial Regression - Mean Absolute Error: 5.969643801920421
Polynomial Regression - R-squared: 0.7842685049729758
Polynomial Regression performs better than Linear Regression


**Statistical Analysis**

In [None]:
# Getting the names of the numerical columns
numerical_cols = data.select_dtypes(include=[np.number])

# Printing the mean, variance, and standard deviation for two chosen columns
for col in numerical_cols.columns[:2]:
  print(f"\nFor column {col}:")
  print("Mean:", data[col].mean())
  print("Variance:", data[col].var())
  print("Standard Deviation:", data[col].std())


For column Cement (component 1)(kg in a m^3 mixture):
Mean: 281.16563106796116
Variance: 10921.742654363268
Standard Deviation: 104.5071416428718

For column Blast Furnace Slag (component 2)(kg in a m^3 mixture):
Mean: 73.89548543689321
Variance: 7444.083725468689
Standard Deviation: 86.27910364316895


# 2. Open-Ended Questions for Analysis and Interpretation

**Understanding the Relationship in the Data**


*Relationship:* In my dataset, there seemed to be a positive linear relationship between the independent variables (cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, age) and the dependent variable (concrete compressive strength).

*Model Performance:* The linear regression model captured this relationship reasonably well, as evidenced by the R² score and MAE. However, there were some instances where the model slightly underestimated or overestimated the compressive strength, suggesting that a more complex model, such as polynomial regression, might be able to capture the non-linear aspects of the relationship more accurately.

**Interpreting Model Coefficients**


*Coefficient Meaning:* The coefficients in the linear regression model represent the change in the dependent variable (concrete compressive strength) for a unit change in the corresponding independent variable. For example, a positive coefficient for cement indicates that an increase in cement content is associated with an increase in compressive strength.

*Slope and Intercept:* The slope of the regression line (coefficient) determines the steepness of the relationship between the variables. A steeper slope means that a small change in the independent variable leads to a larger change in the dependent variable. The intercept represents the predicted value of the dependent variable when all independent variables are zero. While it might not have a direct physical interpretation in this context, it provides a baseline for the model's predictions.

**Comparing Linear and Polynomial**

*Model Performance:* In my analysis, the polynomial regression model slightly outperformed the linear regression model in terms of R² score and MAE. This suggests that the relationship between the variables might not be strictly linear, and the polynomial model was able to capture some of the non-linearity.

*Reasoning:* The polynomial model's improved performance can be attributed to its ability to fit more complex curves to the data. By introducing higher-order terms, the model can capture non-linear patterns that a linear model might miss. However, it's essential to avoid overfitting, as adding too many polynomial terms can lead to a model that is too complex and might not generalize well to new data.

**Model Evaluation and Error Analysis**

*MAE and R²:* The Mean Absolute Error (MAE) provides a measure of the average absolute difference between the predicted and actual values. A lower MAE indicates better accuracy. The R² score measures the proportion of variance in the dependent variable explained by the independent variables. A higher R² score suggests a better fit.

*Error Analysis:* While the errors in my model were generally acceptable, there were some outliers where the predictions were significantly off. These outliers might be due to measurement errors, unusual data points, or other factors. Further investigation into these outliers could help to improve the model's accuracy.

**Impact of Data Preprocessing**

*Missing Values and Scaling:* Handling missing values and scaling features is crucial for building accurate regression models. Missing values can introduce bias and reduce the model's predictive power. Scaling features can help ensure that different variables contribute equally to the model. In my analysis, I ensured that missing values were handled appropriately (e.g., by imputation) and that features were scaled to a common range.

**Real-World Implications**

*Insights and Predictions:* Based on the model, insights can be gained into the factors influencing concrete compressive strength. For example, increasing the amount of cement or fine aggregate can generally lead to higher compressive strength. The model can also be used to predict the compressive strength of concrete mixtures with different compositions.

*Reliability and External Factors:* The reliability of the predictions depends on the quality of the data and the validity of the assumptions underlying the model. External factors such as environmental conditions, manufacturing processes, and material quality can also influence the actual compressive strength.

**Reflection on Statistical Concepts**

*Mean, Variance, and Standard Deviation: *These statistical concepts were instrumental in understanding the distribution of the data. The mean provided a measure of the central tendency, while the variance and standard deviation gave insights into the spread of the data. By analyzing these statistics, I could identify potential outliers and understand the variability in the variables.

**Link to the dataset: https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength**