# DATA INTEGRATION

### Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources into a single dataset and provides a unified view of the data. These sources may include multiple databases, or flat files. 

# Issues in Data Integration: 
##### There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and resolution of data value conflicts. These are explained in brief below. 

## 1. Schema Integration: 
* Integrate metadata from different sources.
* The real-world entities from multiple sources are referred to as the entity identification problem.

## 2. Redundancy: 
* An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
* Inconsistencies in attributes can also cause redundancies in the resulting data set.
* Some redundancies can be detected by correlation analysis.

## 3. Detection and resolution of data value conflicts: 
* This is the third critical issue in data integration.
* Attribute values from different sources may differ for the same real-world entity.
* An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in another.

# Approaches

## Joining two Data Frames:

In [8]:
import pandas as pd
df1 = pd.read_csv('csv files/student.csv')
df1

Unnamed: 0,Reg_no,Name
0,393CS20001,ABHINAV PRASANNA NAIKODI
1,393CS20002,AKASH TORAVI
2,393CS21702,BIRADAR GURURAJ SIDHARUD
3,393CS20005,DUGIWADE PRACHI SANJAY
4,393CS21703,GURUNATH DESHAPANDE
5,393CS20006,HARSHAD MAHAMMADSAB WALIKAR
6,393CS20007,IRAMNAAZ JAMKHANDI
7,393CS20008,JAHID ALI BALABATTI
8,393CS20009,KRISHNA MAHADEV JADHAV
9,393CS20010,LAXMI NAVALAGI


In [9]:
import pandas as pd
df2 = pd.read_csv('csv files/marks.csv')
df2

Unnamed: 0,PP,DSP
0,97,47
1,97,52
2,75,21
3,11,23
4,87,92
5,45,78
6,97,17
7,23,45
8,87,65
9,61,39


In [10]:
df = pd.concat([df1, df2], axis=1, join='inner')
df

Unnamed: 0,Reg_no,Name,PP,DSP
0,393CS20001,ABHINAV PRASANNA NAIKODI,97,47
1,393CS20002,AKASH TORAVI,97,52
2,393CS21702,BIRADAR GURURAJ SIDHARUD,75,21
3,393CS20005,DUGIWADE PRACHI SANJAY,11,23
4,393CS21703,GURUNATH DESHAPANDE,87,92
5,393CS20006,HARSHAD MAHAMMADSAB WALIKAR,45,78
6,393CS20007,IRAMNAAZ JAMKHANDI,97,17
7,393CS20008,JAHID ALI BALABATTI,23,45
8,393CS20009,KRISHNA MAHADEV JADHAV,87,65
9,393CS20010,LAXMI NAVALAGI,61,39


## Adding attributes

In [11]:
df['Total'] = df['PP'] + df['DSP']
df

Unnamed: 0,Reg_no,Name,PP,DSP,Total
0,393CS20001,ABHINAV PRASANNA NAIKODI,97,47,144
1,393CS20002,AKASH TORAVI,97,52,149
2,393CS21702,BIRADAR GURURAJ SIDHARUD,75,21,96
3,393CS20005,DUGIWADE PRACHI SANJAY,11,23,34
4,393CS21703,GURUNATH DESHAPANDE,87,92,179
5,393CS20006,HARSHAD MAHAMMADSAB WALIKAR,45,78,123
6,393CS20007,IRAMNAAZ JAMKHANDI,97,17,114
7,393CS20008,JAHID ALI BALABATTI,23,45,68
8,393CS20009,KRISHNA MAHADEV JADHAV,87,65,152
9,393CS20010,LAXMI NAVALAGI,61,39,100


In [12]:
df['Percentage'] = df['Total']/200*100
df

Unnamed: 0,Reg_no,Name,PP,DSP,Total,Percentage
0,393CS20001,ABHINAV PRASANNA NAIKODI,97,47,144,72.0
1,393CS20002,AKASH TORAVI,97,52,149,74.5
2,393CS21702,BIRADAR GURURAJ SIDHARUD,75,21,96,48.0
3,393CS20005,DUGIWADE PRACHI SANJAY,11,23,34,17.0
4,393CS21703,GURUNATH DESHAPANDE,87,92,179,89.5
5,393CS20006,HARSHAD MAHAMMADSAB WALIKAR,45,78,123,61.5
6,393CS20007,IRAMNAAZ JAMKHANDI,97,17,114,57.0
7,393CS20008,JAHID ALI BALABATTI,23,45,68,34.0
8,393CS20009,KRISHNA MAHADEV JADHAV,87,65,152,76.0
9,393CS20010,LAXMI NAVALAGI,61,39,100,50.0


## Adding data objects

In [13]:
df1.append(df2)

  df1.append(df2)


Unnamed: 0,Reg_no,Name,PP,DSP
0,393CS20001,ABHINAV PRASANNA NAIKODI,,
1,393CS20002,AKASH TORAVI,,
2,393CS21702,BIRADAR GURURAJ SIDHARUD,,
3,393CS20005,DUGIWADE PRACHI SANJAY,,
4,393CS21703,GURUNATH DESHAPANDE,,
...,...,...,...,...
35,,,77.0,70.0
36,,,43.0,27.0
37,,,10.0,14.0
38,,,90.0,22.0


In [14]:
pd.concat([df1,df2])

Unnamed: 0,Reg_no,Name,PP,DSP
0,393CS20001,ABHINAV PRASANNA NAIKODI,,
1,393CS20002,AKASH TORAVI,,
2,393CS21702,BIRADAR GURURAJ SIDHARUD,,
3,393CS20005,DUGIWADE PRACHI SANJAY,,
4,393CS21703,GURUNATH DESHAPANDE,,
...,...,...,...,...
35,,,77.0,70.0
36,,,43.0,27.0
37,,,10.0,14.0
38,,,90.0,22.0
