# Task 1: Data Analysis and Insights Generation

Instructions:
1. Obtain a real-world dataset related to a specific domain (e.g., sales, marketing, customer behavior).
2. Perform exploratory data analysis using appropriate tools (Python, R, data visualization platforms).
3. Clean and preprocess the data, handling missing values, outliers, and inconsistent formats.
4. Conduct statistical analysis, applying measures like mean, median, standard deviation, and correlation coefficients.
5. Apply advanced analytical methods (regression analysis, clustering) to identify patterns and trends.
6. Use data visualization techniques to present findings effectively (charts, graphs).
7. Interpret results, providing actionable insights and recommendations aligned with business objectives.
8. Prepare a comprehensive report summarizing the analysis approach, key findings, and recommendations.


In [87]:
import pandas as pd

In [88]:
data = pd.read_csv("Datasets/credit_approval/crx.data",header=None,na_values="?")

In [89]:
# data.head()
data.info()
# data[0].value_counts()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       678 non-null    object 
 1   1       678 non-null    float64
 2   2       690 non-null    float64
 3   3       684 non-null    object 
 4   4       684 non-null    object 
 5   5       681 non-null    object 
 6   6       681 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      677 non-null    float64
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(4), int64(2), object(10)
memory usage: 86.4+ KB


Unnamed: 0,1,2,7,10,13,14
count,678.0,690.0,690.0,690.0,677.0,690.0
mean,31.568171,4.758725,2.223406,2.4,184.014771,1017.385507
std,11.957862,4.978163,3.346513,4.86294,173.806768,5210.102598
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,22.6025,1.0,0.165,0.0,75.0,0.0
50%,28.46,2.75,1.0,0.0,160.0,5.0
75%,38.23,7.2075,2.625,3.0,276.0,395.5
max,80.25,28.0,28.5,67.0,2000.0,100000.0


### Dataset details: 

Male&emsp;           A1:	b, a.<br>
Age&emsp;            A2:	continuous.<br>
Debt&emsp;           A3:	continuous.<br>
Married&emsp;        A4:	u, y, l, t.<br>
BankCustomer&emsp;   A5:	g, p, gg.<br>
EducationLevel&emsp; A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.<br>
Ethnicity&emsp;      A7:	v, h, bb, j, n, z, dd, ff, o.<br>
YearsEmployed&emsp;  A8:	continuous.<br>
PriorDefault&emsp;   A9:	t, f.<br>
Employed&emsp;       A10:	t, f.<br>
CreditScore&emsp;    A11:	continuous.<br>
DriversLicense&emsp; A12:	t, f.<br>
Citizen&emsp;        A13:	g, p, s.<br>
ZipCode&emsp;        A14:	continuous.<br>
Income&emsp;         A15:	continuous.<br>
Approved&emsp;       A16: +,-         (class attribute)<br>
(from crx.names file, Source:https://archive.ics.uci.edu/dataset/27/credit+approval)<br>
(Column descriptions, Reference:http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html)

In [90]:
numeric = [1, 2, 7, 10, 14]
#A14 is zip code. Should be treated as categorical rather than numeric "data[13]"
data_num = data[numeric]
data_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       678 non-null    float64
 1   2       690 non-null    float64
 2   7       690 non-null    float64
 3   10      690 non-null    int64  
 4   14      690 non-null    int64  
dtypes: float64(3), int64(2)
memory usage: 27.1 KB


In [91]:
categorical = [0, 3, 4, 5, 6, 8, 9, 11, 12, 13]
data_cat = data[categorical]
data_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       678 non-null    object 
 1   3       684 non-null    object 
 2   4       684 non-null    object 
 3   5       681 non-null    object 
 4   6       681 non-null    object 
 5   8       690 non-null    object 
 6   9       690 non-null    object 
 7   11      690 non-null    object 
 8   12      690 non-null    object 
 9   13      677 non-null    float64
dtypes: float64(1), object(9)
memory usage: 54.0+ KB


### Data Pre-processing cleaning, missing values and outliers
- Taking mean for Age column in data_num
- Filling categorical columns with mode value of that column in data_cat
- Converting Categorical values to numerical by Label Encoder

In [111]:
# data_num.info()
data_num = data_num.fillna(data_num.mean())
# data_num.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       690 non-null    float64
 1   2       690 non-null    float64
 2   7       690 non-null    float64
 3   10      690 non-null    int64  
 4   14      690 non-null    int64  
dtypes: float64(3), int64(2)
memory usage: 27.1 KB


In [112]:
# data_cat.info()
for column in data_cat.columns:
  data_cat[column] = data_cat[column].fillna(data_cat[column].value_counts().index[0])
# data_cat.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   3       690 non-null    object 
 2   4       690 non-null    object 
 3   5       690 non-null    object 
 4   6       690 non-null    object 
 5   8       690 non-null    object 
 6   9       690 non-null    object 
 7   11      690 non-null    object 
 8   12      690 non-null    object 
 9   13      690 non-null    float64
dtypes: float64(1), object(9)
memory usage: 54.0+ KB


In [117]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for column in data_cat.columns:
    data_cat[column] = label_encoder.fit_transform(data_cat[column])
    
# data_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       690 non-null    int32
 1   3       690 non-null    int32
 2   4       690 non-null    int32
 3   5       690 non-null    int32
 4   6       690 non-null    int32
 5   8       690 non-null    int32
 6   9       690 non-null    int32
 7   11      690 non-null    int32
 8   12      690 non-null    int32
 9   13      690 non-null    int64
dtypes: int32(9), int64(1)
memory usage: 29.8 KB


In [123]:
data_cleaned = pd.concat([data_num,data_cat],axis=1)
# data_cleaned.info()
# data_cleaned[7].info()
# data_cleaned.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       690 non-null    float64
 1   2       690 non-null    float64
 2   7       690 non-null    float64
 3   10      690 non-null    int64  
 4   14      690 non-null    int64  
 5   0       690 non-null    int32  
 6   3       690 non-null    int32  
 7   4       690 non-null    int32  
 8   5       690 non-null    int32  
 9   6       690 non-null    int32  
 10  8       690 non-null    int32  
 11  9       690 non-null    int32  
 12  11      690 non-null    int32  
 13  12      690 non-null    int32  
 14  13      690 non-null    int64  
dtypes: float64(3), int32(9), int64(3)
memory usage: 56.7 KB


Unnamed: 0,1,2,7,10,14,0,3,4,5,6,8,9,11,12,13
0,30.83,0.0,1.25,1,0,1,1,0,12,7,1,1,0,0,68
1,58.67,4.46,3.04,6,560,0,1,0,10,3,1,1,0,0,11
2,24.5,0.5,1.5,0,824,0,1,0,10,3,1,0,0,0,96
3,27.83,1.54,3.75,5,3,1,1,0,12,7,1,1,1,0,31
4,20.17,5.625,1.71,0,0,1,1,0,12,7,1,0,0,2,37
