# ntegrating Pandas with Other Data Science Libraries (NumPy, Scikit-learn)

## Introduction to Numpy and Scikit-learn
NumPy: A fundamental library for numerical computing in Python.
Scikit-learn: A robust library for machine learning in Python.

### Creating a Dataset with Python Faker
First, install the required libraries if you haven't already:

In [60]:
import pandas as pd
import numpy as np
from faker import Faker

# Initialize Faker
fake = Faker()

# Create a dataset
data = {
    'name': [fake.name() for _ in range(100)],
    'age': [fake.random_int(min=18, max=80) for _ in range(100)],
    'salary': [fake.random_int(min=30000, max=120000) for _ in range(100)],
    'city': [fake.city() for _ in range(100)],
    'purchase_amount': [fake.random_number(digits=5) for _ in range(100)]
}

df = pd.DataFrame(data)
print(df.head())


                       name  age  salary                city  purchase_amount
0                John Allen   36   62325        Nicholsmouth            49855
1                John Weber   36   74276  North Kimberlyport            48734
2  Mrs. Barbara Mcintyre MD   43   81505          New Nicole            44109
3       Dr. Colleen Shannon   44   92182     North Cassandra            25202
4           Mr. Joshua Diaz   50  110606         West Nicole            22587


## Using NumPy with Pandas

NumPy arrays can be used for efficient numerical operations. Here’s how you can integrate NumPy with Pandas:

In [61]:
import numpy as np

# Convert a Pandas column to a NumPy array
ages = df['age'].values
print(ages[:5])

# Perform a NumPy operation
mean_age = np.mean(ages)
print(f"Mean age: {mean_age}")

# Add a new column with NumPy operations
df['age_squared'] = np.square(df['age'])
print(df.head())


[36 36 43 44 50]
Mean age: 48.37
                       name  age  salary                city  purchase_amount  \
0                John Allen   36   62325        Nicholsmouth            49855   
1                John Weber   36   74276  North Kimberlyport            48734   
2  Mrs. Barbara Mcintyre MD   43   81505          New Nicole            44109   
3       Dr. Colleen Shannon   44   92182     North Cassandra            25202   
4           Mr. Joshua Diaz   50  110606         West Nicole            22587   

   age_squared  
0         1296  
1         1296  
2         1849  
3         1936  
4         2500  


## Data Preprocessing with Scikit-learn
Scikit-learn provides tools for preprocessing data:

In [62]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the columns
numeric_features = ['age', 'salary', 'purchase_amount']
categorical_features = ['city']

# Create transformers
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit and transform the dataset
df_preprocessed = preprocessor.fit_transform(df)
print(df_preprocessed[:5])


  (0, 0)	-0.7198273493389845
  (0, 1)	-0.3844253575991263
  (0, 2)	-0.05535646154969576
  (0, 53)	1.0
  (1, 0)	-0.7198273493389845
  (1, 1)	0.061782480076450576
  (1, 2)	-0.09636531259199622
  (1, 61)	1.0
  (2, 0)	-0.31248770137027854
  (2, 1)	0.33168763098907256
  (2, 2)	-0.2655588327263759
  (2, 49)	1.0
  (3, 0)	-0.2542963230890348
  (3, 1)	0.7303288394956523
  (3, 2)	-0.9572219430357202
  (3, 57)	1.0
  (4, 0)	0.09485194659842738
  (4, 1)	1.418215476708842
  (4, 2)	-1.0528848738792667
  (4, 96)	1.0


## Machine Learning with Scikit-learn

In [63]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Convert to DataFrame for better readability (optional)
df = pd.DataFrame(data=X, columns=iris.feature_names)
df['target'] = y

# Print DataFrame head to inspect the data
print("DataFrame head:\n", df.head())

# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a preprocessor (standard scaler in this case)
preprocessor = StandardScaler()

# Create a pipeline that includes preprocessing and model training
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train the model
model_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = model_pipeline.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")


DataFrame head:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
Mean Squared Error: 0.03711379440797688
