<a href="https://colab.research.google.com/github/adnaksbhat/Colab_JupyterNotebooks/blob/main/Chatgpt_codes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Analysis of ECN/ECR

In [1]:
print("hi")

hi


i have csv files with columns 'RefId' , 'System name', 'part name', 'changes' , 'ebom' , 'comments'.
This file is related to changes done in middle of a work

help me preprocess this file and make an analysis on how to notify in prior what changes may occur beforehand

To preprocess this CSV file and analyze the data for predicting potential changes beforehand, follow these steps:

### Step 1: Data Cleaning
1. **Handle Missing Values**:
   - Identify missing values in each column.
   - For columns like 'RefId' and 'part name', missing values may be problematic, so consider filling them based on similar entries or removing such rows.
   - For 'comments' or 'changes', fill in with default text like "No comments" if missing.

2. **Standardize Text**:
   - Convert 'System name', 'part name', and 'changes' to lowercase for consistency.
   - Remove special characters, extra spaces, and typos.

3. **Remove Duplicates**:
   - Check for and remove any duplicate entries based on 'RefId' or a combination of relevant columns.

### Step 2: Feature Engineering
1. **Create New Features**:
   - **Change Type**: Categorize changes (e.g., "minor update," "major redesign") based on keywords in the 'changes' column.
   - **System/Part Frequency**: Calculate how frequently each system or part has undergone changes.
   - **Time Since Last Change**: If you have timestamps, compute the duration between the current and previous changes.

2. **Text Analysis for 'comments' and 'changes'**:
   - Perform sentiment analysis or keyword extraction to see if certain terms or sentiment patterns correlate with specific change types.

### Step 3: Analysis and Prediction
1. **Exploratory Data Analysis (EDA)**:
   - Visualize change frequency over time (e.g., number of changes per month).
   - Identify patterns linking certain systems or parts to frequent changes.
   - Use correlation analysis to find relationships between features.

2. **Predictive Modeling**:
   - **Classification Models**: Predict the likelihood of specific changes occurring based on features.
   - **Time Series Analysis**: If you have a timeline, predict future changes using models like ARIMA or LSTM.

3. **Anomaly Detection**:
   - Use anomaly detection algorithms to flag unusual changes that may indicate unexpected issues.

These steps will help preprocess the data and build a foundation for predictive analysis. Let me know if you need help with any of these tasks!

Here's a complete code example that follows the steps for preprocessing the data, performing exploratory data analysis (EDA), and building a basic predictive model using Python libraries like `pandas`, `scikit-learn`, and `matplotlib`. This code assumes you have the necessary libraries installed.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest

# Step 1: Data Loading
df = pd.read_csv('your_file.csv')

# Step 2: Data Cleaning
# 2.1 Handle Missing Values
df.fillna({'comments': 'No comments', 'changes': 'No change'}, inplace=True)
df.dropna(subset=['RefId', 'part name'], inplace=True)

# 2.2 Standardize Text
df['System name'] = df['System name'].str.lower().str.strip()
df['part name'] = df['part name'].str.lower().str.strip()
df['changes'] = df['changes'].str.lower().str.strip()

# 2.3 Remove Duplicates
df.drop_duplicates(subset=['RefId'], inplace=True)

# Step 3: Feature Engineering
# 3.1 Change Type Classification
def categorize_change(change_text):
    if "redesign" in change_text:
        return "major redesign"
    elif "update" in change_text:
        return "minor update"
    else:
        return "other"

df['change_type'] = df['changes'].apply(categorize_change)

# 3.2 System/Part Frequency
df['system_frequency'] = df.groupby('System name')['System name'].transform('count')
df['part_frequency'] = df.groupby('part name')['part name'].transform('count')

# Step 4: Exploratory Data Analysis (EDA)
# 4.1 Plotting the Distribution of Change Types
plt.figure(figsize=(10, 6))
sns.countplot(x='change_type', data=df)
plt.title('Distribution of Change Types')
plt.show()

# 4.2 Heatmap of Correlation Matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='viridis')
plt.title('Correlation Matrix')
plt.show()

# Step 5: Predictive Modeling
# 5.1 Label Encoding for Classification
label_encoder = LabelEncoder()
df['change_type_encoded'] = label_encoder.fit_transform(df['change_type'])

# 5.2 Selecting Features for Prediction
features = ['system_frequency', 'part_frequency']
X = df[features]
y = df['change_type_encoded']

# 5.3 Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5.4 Training a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 5.5 Making Predictions and Evaluation
y_pred = clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Step 6: Anomaly Detection
# 6.1 Using Isolation Forest for Anomaly Detection
iso_forest = IsolationForest(contamination=0.05, random_state=42)
df['anomaly_score'] = iso_forest.fit_predict(X)

# Visualizing Anomalies
plt.figure(figsize=(10, 6))
sns.scatterplot(x='system_frequency', y='part_frequency', hue='anomaly_score', data=df, palette='coolwarm')
plt.title('Anomaly Detection')
plt.show()
```

### Explanation:
1. **Data Loading & Cleaning**: Reads the CSV, handles missing values, standardizes text, and removes duplicates.
2. **Feature Engineering**: Categorizes the 'changes' column, calculates the frequency of systems and parts, and adds these as new features.
3. **EDA**: Uses plots for the distribution of change types and a heatmap to show correlations.
4. **Predictive Modeling**: Implements a Random Forest classifier for predicting change types.
5. **Anomaly Detection**: Applies Isolation Forest to detect unusual changes.

Make sure to replace `'your_file.csv'` with the path to your actual CSV file. Let me know if you need more customization or help with a specific part!