DOMAIN: Manufacturing

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Solution approach: In order to find data for target column, we can use a machine learning model which can read the pattern among the existing complete data and accordingly generate the synthetic data for remaining incomplete data.



In [13]:
#import libraries that will be used for EDA
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)

In [2]:
#Import data and take the first look at the data
data = pd.read_excel('Part2 - Company.xlsx')
data.head()

Unnamed: 0,A,B,C,D,Quality
0,47,27,45,108,Quality A
1,174,133,134,166,Quality B
2,159,163,135,131,
3,61,23,3,44,Quality A
4,59,60,9,68,Quality A


In [3]:
#Check the shape of the data
data.shape

(61, 5)

In [4]:
#We can see that we have null values only in Quality field
data.isnull().any()

A          False
B          False
C          False
D          False
Quality     True
dtype: bool

In [5]:
# We have total of 18 columns with null values
data['Quality'].isnull().sum()

18

## Lets generate the model  based on available data and generate synthetic data for missing values under Quality column

In [6]:
#Saperating data with complete and incomplete values
incomplete_data = data[data['Quality'].isnull() == True]
complete_data = data[data['Quality'].isnull() == False]

In [7]:
#Check the index positions of missing values
incomplete_data.index

Int64Index([2, 5, 7, 9, 14, 18, 23, 27, 29, 32, 35, 40, 46, 52, 57, 58, 59,
            60],
           dtype='int64')

In [8]:
#Check the index positions of complete data
complete_data.index

Int64Index([ 0,  1,  3,  4,  6,  8, 10, 11, 12, 13, 15, 16, 17, 19, 20, 21, 22,
            24, 25, 26, 28, 30, 31, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 45,
            47, 48, 49, 50, 51, 53, 54, 55, 56],
           dtype='int64')

In [9]:
# Split Xdrop and y into training and test set  based on available data
x_train = complete_data.drop('Quality', axis = 1)
y_train = complete_data['Quality']
x_test = incomplete_data.drop('Quality', axis = 1)
y_test = incomplete_data['Quality']

In [10]:
# Creating logistic Regresison model to generate predict missing values of Quality
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
print("Synthetic data generated for incomplete data is\n", y_predict)

Synthetic data generated for incomplete data is
 ['Quality B' 'Quality B' 'Quality B' 'Quality B' 'Quality B' 'Quality B'
 'Quality B' 'Quality B' 'Quality A' 'Quality B' 'Quality B' 'Quality B'
 'Quality B' 'Quality B' 'Quality B' 'Quality A' 'Quality B' 'Quality B']


In [14]:
# Now that we have Synthetic data generated for missing values, lets complete the data by filling in missing values
incomplete_data['Quality'] = y_predict
Newdata = incomplete_data.append(complete_data)
Newdata.sort_index()

Unnamed: 0,A,B,C,D,Quality
0,47,27,45,108,Quality A
1,174,133,134,166,Quality B
2,159,163,135,131,Quality B
3,61,23,3,44,Quality A
4,59,60,9,68,Quality A
5,153,140,154,199,Quality B
6,34,28,78,22,Quality A
7,191,144,143,154,Quality B
8,160,181,194,178,Quality B
9,145,178,158,141,Quality B


In [12]:
#Lets confirm themissing values again
Newdata['Quality'].isnull().sum()

0