# Read most commonly used file formats in Data Science using Python

Information is scrapped from this [link](https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/)

## .CSV file

In [None]:
import pandas as pd
df = pd.read_csv("directory path")

## XLSX file

In [None]:
import pandas as pd 
df = pd.read_excel("directory path", sheetname = "tab sheet")

## ZIP file

Importing the “zipfile” package. Below is the python code which can read the “train.csv” file that is inside the “T.zip”.

In [None]:
import zipfile
archive = zipfile.ZipFile('T.zip', 'r')
df = archive.read('train.csv')

## Plain Text (txt) file format

In [None]:
text_file = open("text.txt", "r")
lines = text_file.read()

## JSON file format

The JSON file format can be easily read in any programming language because it is language-independent data format.

In [None]:
import pandas as pd
df = pd.read_json(“/home/kunal/Downloads/Loan_Prediction/train.json”)

## XML file format

XML is also known as Extensible Markup Language. As the name suggests, it is a markup language. It has certain rules for encoding data. XML file format is a human-readable and machine-readable file format. XML is a self-descriptive language designed for sending information over the internet. XML is very similar to HTML, but has some differences.

In [None]:
import xml.etree.ElementTree as ET
tree = ET.parse('/home/sunilray/Desktop/2 sigma/train.xml')
root = tree.getroot()
print root.tag

## HTML files

Information is scrapped from [here](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)

HTML stands for Hyper Text Markup Language. It is the standard markup language which is used for creating Web pages. HTML is used to describe structure of web pages using markup. HTML tags are same as XML but these are predefined. 

The action to read HTML file is refered to web scrapping. There are 2 x modules for scrapping data:
 * Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc
 * BeautifulSoup: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages

### Basics - Get familiar with HTML tags

While performing web scarping, we deal with html tags. Thus, we must have good understanding of them. 
If you already know basics of HTML, you can skip this section. Below is the basic syntax of HTML: 

In [None]:
<!DOCTYPE html>
<html>
    <body>
        <hi1> My First Heading </h1>
        <p> My First paragraph </p>
    <body>
        </html>

This syntax has various tags as elaborated below:

    * <!DOCTYPE html> : HTML documents must start with a type declaration
    * HTML document is contained between <html> and </html>
    * The visible part of the HTML document is between <body> and </body>
    * HTML headings are defined with the <h1> to <h6> tags
    * HTML paragraphs are defined with the <p> tag

Other useful links
 1. HTML links are defined with the <a> tag, “<a href=“http://www.test.com”>This is a link for test.com</a>”
 2. HTML tables are defined with<Table>, row as <tr> and rows are divided into data as <td>
        <table style = "width:100%">
    <tr>
        <td> Jill <td>
        <td> Smith <td>
        <td> 50 <td>
    </tr>
    <tr>
        <td> Eve <td>
        <td> Jackson <td>
        <td> 94 <td>
    </tr>
    </table>
 3. HTML list starts with <ul> (unordered) and <ol> (ordered). Each item of list starts with <li>

### Scrapping a webpage using Beautiful Soup

##### 1. Import necessary libraries

#import the library used to query a website
import urllib.request #if you are using python3+ version, import urllib.request
#specify the url
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki) #For python 3 use urllib.request.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)

#### 2. Use function "prettify" to look at nested structure of HTML page

print soup.prettify()

#### 3. Work with HTML tags

 * soup.<tag> : Return content between opening and closing tag including tag

 * soup.<tag>.string : Return string within given tag.

In [None]:
 * Find all the links within page’s <a> tags::  We know that, we can tag a link using tag “<a>.  
So, we should go with option soup.a and it should return the links available in the web page.

In [None]:
 * Now to extract all the links within <a>, we will use “find_all().e.g soup.find_all("a")

In [None]:
 * To show only links, we need to iterate over each a tag and return the link uding hte attribute "href" with link.get
         # all_links = soup.findall("a")
         # for link in all_links:
         #      print link.get("href")

In [None]:
 * Find the right table
         # all_tables = soup.findall('table')
Now to identify the right table, we will use attribute “class” of table and use it to filter the right table. 
In chrome, you can check the class name by right click on the required table of web page 
–> Inspect element –> Copy the class name OR go through the output of above command find the class name of right table.
         # right_table=soup.find('table', class_='wikitable sortable plainrowheaders')

In [None]:
 * Extract the information to DataFrame
    Here, we need to iterate through each row (tr) 
    and then assign each element of tr (td) to a variable and append it to a list

## Image file

Image files are probably the most fascinating file format used in data science. Any computer vision application is based on image processing. So it is necessary to know different image file formats.

Usual image files are 3-Dimensional, having RGB values. But, they can also be 2-Dimensional (grayscale) or 4-Dimensional (having intensity) – an Image consisting of pixels and meta-data associated with it.

Each image consists one or more frames of pixels. And each frame is made up of two-dimensional array of pixel values. Pixel values can be of any intensity.  Meta-data associated with an image, can be an image type (.png) or pixel dimensions.

In [None]:
from scipy import misc
f = misc.face()
misc.imsave('face.png', f) # uses the Image module (PIL)
import matplotlib.pyplot as plt
plt.imshow(f)
plt.show()
type(f) , f.shape

## Hierarchical Data Format (HDF)

In Hierarchical Data Format ( HDF ), you can store a large amount of data easily. It is not only used for storing high volumes or complex data but also used for storing small volumes or simple data.

The advantages of using HDF are as mentioned below:

    It can be used in every size and type of system
    It has flexible, efficient storage and fast I/O.
    Many formats support HDF.

There are multiple HDF formats present. But, HDF5 is the latest version which is designed to address some of the limitations of the older HDF file formats. HDF5 format has some similarity with  XML. Like XML, HDF5 files are self-describing and allow users to specify complex data relationships and dependencies.

In [None]:
t = pd.read_hdf(‘train.h5’)

## PDF file format

In [None]:
here exists a library which do a good job in parsing PDF file, one of them is PDFMiner. To read a PDF file through PDFMiner, you have to:
    * Install PDFMiner
    * pdf2txt.py <pdf_file>.pdf

## DOCX format

In [None]:
pip install docx2txt
import docx2txt
text = docx2txt.process("file.docx")

## MP3 file format

## MP4 file format

## SQL to Pandas DataFrame

### Create a table with below code with the test_database created in git

In [2]:
import sqlite3

conn = sqlite3.connect('test_database') 
c = conn.cursor()

c.execute('''
          CREATE TABLE IF NOT EXISTS products
          ([product_id] INTEGER PRIMARY KEY, [product_name] TEXT, [price] INTEGER)
          ''')
          
c.execute('''
          INSERT INTO products (product_id, product_name, price)

                VALUES
                (1,'Computer',800),
                (2,'Printer',200),
                (3,'Tablet',300),
                (4,'Desk',450),
                (5,'Chair',150)
          ''')                     

conn.commit()

NameError: name 'test_database' is not defined

### Get from SQL to Pandas DataFrame

In [5]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('test_database') 
          
sql_query = pd.read_sql_query ('''
                               SELECT
                               *
                               FROM products
                               ''', conn)

df = pd.DataFrame(sql_query, columns = ['product_id', 'product_name', 'price'])
print (df)
max_price = df.price.max()
print(max_price)

   product_id product_name  price
0           1     Computer    800
1           2      Printer    200
2           3       Tablet    300
3           4         Desk    450
4           5        Chair    150
800
