## Merging data from multiple data sources


In this article, you will learn about combining data from various sources. Data comes in many forms and types; hence it is essential for a data-driven decision maker to know how to merge datasets from disparate sources. A python is a preferred tool for this task since it can access data in multitudes of formats, and it provides objects like the data frame to store and manipulate data. In this article, we will fetch and combine data from three different sources. 

1. A CSV file - Customer_Order table
2. An SQL Database - sqlite
3. An HTML web page

All three sources will yield numeric and text data hence we can store them in a pandas database. 
<br>
The image below shows how these various data sources are linked.<br>
<img src="../../../images/merge.PNG" style="height:70vh">
<br>


We will use python to import data from three different sources. Then convert each dataset to a pandas dataframe and combine the dataframes to generate a master dataframe. 

#### Pulling data from a csv file 

The first task is to pull data from a csv file


```python 

import pandas as pd

cust_order_link = '../../../data/Customer_Order.csv'
# Reading data into tables
df_csv_custorder = pd.read_csv(cust_order_link)
df_csv_custorder
```

The above code will import the csv file into a pandas dataframe and display the dataframe

#### Pulling data from a SQL database 

The second block of code imports a SQL database file into pandas. 

```python 

# Importing library
import sqlite3

# Connecting to the database
oltp_con = sqlite3.connect(productdb_link)

# Creating a cursor on the database connection
oltp_cur = oltp_con.cursor()

# Executing query
oltp_cur.execute('''SELECT * FROM product''')
products = oltp_cur.fetchall()

# List of tuples - conversion to Dataframe
df_db_products = pd.DataFrame(products, columns=['productno', 'productname'])
df_db_products

```

The SQL database is read by using the sqlite3 package. All columns from the product table are read and a dataframe is created with two columns. 

#### Pulling data from a html page

The third step is to read data from HTML content on the web. The HTML page is: 

https://raw.githubusercontent.com/colaberry/DSin100days/master/data/order-transactions.html

The above link contains the HTML code for a page. Study the link to undestrand the different components of the HTML page (class, div tags) and use it to get transaction details with the following tags 

{"orderno": orderno , "productno" : productno , "quantity" : quantity , "total" : total}


```python 

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="transaction")
#print(tag)
transactions = []

for t in tag:
    #print(t.span.text)
    order = t.find("small", class_="orderno")
    orderno = order.text
    product = t.find("small", class_="productno")
    productno = product.text
    qty = t.find("small", class_="quantity")
    quantity = qty.text
    ttl = t.find("small", class_="total")
    total = ttl.text
    order_detail = {"orderno": orderno , "productno" : productno , "quantity" : quantity , "total" : total}
    #print(order_detail)
    transactions.append(order_detail)

df_web_transactions = pd.DataFrame(transactions)
df_web_transactions


```

First, everything within the 'transaction' div is identified. Then for each transaction in the transaction div the order details are extracted and stored in a list called 'transactions'. This list is then converted to a pandas data frame. 


#### Combining  the dataframes 

The three dataframes can be combined to get a master dataset that contains information from all the three sources: 

```python
# Printing all 3 tables out
print(df_web_transactions)
print(df_db_products)
df_csv_custorder.columns = ['customerno', 'orderno']
print(df_csv_custorder)

>>> # Output
>>> orderno  productno  quantity   total
>>> 0    22345       1568         1   200.0
>>> 1    46238       4321         1   500.0
>>> 2    66266       7317         1   700.0
>>> 3    67222       7317         1   700.0
>>> 4    67222       2371         1   800.0
>>> 5    21573       2931         1  1200.0
>>> 6    11467       5873         1   200.0
>>> productno               productname
>>> 0       1568               WB A1 Paper
>>> 1       4321            3M Scotch Tape
>>> 2       2371    Pilot pens - Set of 10
>>> 3       2931  Pilot LE pens - Set of 3
>>> 4       7317             Regis Stapler
>>> 5       5873         Pidilite Glustick
>>>    customerno  orderno
>>> 0         100    22345
>>> 1         101    46238
>>> 2         102    66266
>>> 3         100    67222
>>> 4         102    21573
>>> 5         101    11467
```

Print the list of product names for orders placed by customer no.100.
```python
# Combining transactions and product names from product table
temp = pd.merge(df_web_transactions, df_db_products[['productno','productname']], on='productno')
# Merging transactions and custorder table to link Customers to orders
result = pd.merge(temp, df_csv_custorder[['customerno','orderno']], on='orderno')
# Condition to filter products purchased by customer no.100
prods = result[result['customerno']==100]['productname']
print(prods)

>>> # Output
>>> 0               WB A1 Paper
>>> 3             Regis Stapler
>>> 4    Pilot pens - Set of 10
>>> Name: productname, dtype: object
```

Print the list of customer numbers who have ordered the product 'Regis Stapler'.
```python
# Condition to extract customer nos who purchased 'Regis Stapler'
custs = result[result['productname']=='Regis Stapler']['customerno']
print(custs)

>>> # Output
>>> 2    102
>>> 3    100
>>> Name: customerno, dtype: int64
```



## Creating new data from old

As you saw above, by combining data from multiple datasets, we created a new dataset. This new dataset can be visualized to provide insights or can be used to train a machine learning model.

Notice that this was only possible since we were dealing with text and numeric data types. With image and audio data types, it is not possible to store them in a dataframe in their raw form. One would have to extract features and then store them in a dataframe. 

If you want to learn more about combining datasets using pandas, check out our course at https://refactored.ai. Our course covers everything from introductory Python to Pandas, data visualization, and machine learning techniques.
