# PYTHON-PANDAS: INPUT DATA INTO PYTHON WITH PANDAS

By: Hector Alvaro Rojas &nbsp;&nbsp;|&nbsp;&nbsp; Data Science, Visualizations and Applied Statistics &nbsp;&nbsp;|&nbsp;&nbsp; January 16, 2018<br>
    Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]
    <hr>

## I INTRODUCTION

In any project aimed at the analysis and modeling of information based on real data, a fundamental step consists of importing the data into Python ecosystem. In this context, knowing how to get the data imported is fundamental.

So things, this project is aimed at covering some of the most frequently used functions and methods of Pandas library to cover the stage of importing the data into Python. If I do not include more than one format that the reader could consider "important" according to their own experience or opinion, I will let he (she) look for it on the web or in Python documentation. Nevertheless, it is clear that suffering from the knowledge and functioning of how to import into python using some of the formats presented here can constitute a serious professional mistake.

In another project, I will cover how to import data from the most common data management platforms like MS-SQL, MySQL, and others. 

## Table of Contents
><b>I INTRODUCTION</b><br>
><b>II IMPORT REQUIRED PACKAGES</b><br>
><b>III INPUTTING DATA WITH PANDAS</b><br>
>* <b>1 Input data by using List</b><br>
>* <b>2 Input data as Pandas dataframe directly by using pandas</b><br>
>* <b>3 Import data in txt format</b><br>
	><i>31 Using read_table()</i><br>
	><i>32 Using read_csv()</i><br>
>* <b>4 Import data in csv format</b><br>
>* <b>5 Import data in Excel format</b><br>
>* <b>6 Import data in JSON format</b><br>
>* <b>7 Import data in ZIP format</b><br>
>* <b>8 Import data in SAS format [sas7bdat]</b><br>
>* <b>9 Import data in STATA format</b><br>
>* <b>10 Import data in XML format</b><br>

## II IMPORT REQUIRED PACKAGES

In [38]:
import os
import pandas as pd
import numpy as np
import zipfile
import xml.etree.ElementTree as et
from lxml import objectify

## III INPUTTING DATA WITH PANDAS

## 1 Input data by using List

Remember that List is a sequence of multiple values [See: Item 322 Python List of my project http://nbviewer.jupyter.org/github/arqmain/Python/blob/master/Python/Project1/PYTHON-Project1_Generalities_and_Introduction.ipynb]. It allows us to store different types of data such as integer, float, string etc.

“List” is not really a Pandas command but we will use it here as the basic way to input data into python.


In [7]:
# Example:
Numbers = [-4,-5.8,-1000,13.23,100,1000]
print("Numbers is a List object that contains the numbers ", Numbers)

Lett = ['A', 'B', 'C', 'Best', 'Manon']
print("Lett is a List object that contains the strings" ,Lett)

('Numbers is a List object that contains the numbers ', [-4, -5.8, -1000, 13.23, 100, 1000])
('Lett is a List object that contains the strings', ['A', 'B', 'C', 'Best', 'Manon'])


## 2 Input data as Pandas dataframe directly by using pandas

We can build dataframe using DataFrame() function of pandas package.

In [2]:
# Example:
mydata = {'col1': ['a', 'b', 'x', 'y', 'z'],
	'col2': ['1', '2', '3', '4', '5'],
	'col3': [4.2, 0.03, 1.5, 2.5, 38],
	's': [1, 2, 4.7, 'pandas', 10],
	'cost' : [1020, 1625.2, 1204,'', 1020],
	'sports' : ['strongly agree', 'disagree', 'agree','strongly disagree', 'agree'],
	'politics' : [3, 0, 2, 2, 4] }
df = pd.DataFrame(mydata)
df.head(5)

Unnamed: 0,col1,col2,col3,cost,politics,s,sports
0,a,1,4.2,1020.0,3,1,strongly agree
1,b,2,0.03,1625.2,0,2,disagree
2,x,3,1.5,1204.0,2,4.7,agree
3,y,4,2.5,,2,pandas,strongly disagree
4,z,5,38.0,1020.0,4,10,agree


## 3 Import data in txt format

In [None]:
First of all, get to know your working directory:

In [5]:
os.getcwd()

'C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2'

### 31 Using read_table()

#### 311 Data separated by SPACE ('')

In [8]:
#Example 1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

df = pd.read_table(path, sep ='\s+') # Index in the first column "col=0" and header on the first "row"
df.head(5)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [9]:
#Example 2:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

df = pd.read_table(path, sep =' ')
df.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [10]:
#Example 3:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

df = pd.read_table(path, sep ='\s+', header=None)
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,buying,maint,doors,persons,lug_boot,safety,class
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc


In [11]:
#Example 4: Importing the dataset partially
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

df = pd.read_table(path, sep ='\s+', nrows=80, skiprows=(1,2,5,10,60), usecols=(1,3,5))
print("df.head(5)\n",df.head(5))
print("Length of df = ",len(df))

df.head(5)
    maint persons safety
0  vhigh       2   high
1  vhigh       2    low
2  vhigh       2   high
3  vhigh       2    low
4  vhigh       2    med
Length of df =  80


In [12]:
#Example 5: Specify dot (.) values as missing values
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

df = pd.read_table(path, sep ='\s+', na_values=['.'])
df.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [7]:
# Tell pandas to look for rows with missing values
df[(df['buying'].isnull()) |
              (df['maint'].isnull()) |
              (df['doors'].isnull()) |
              (df['persons'].isnull()) |
              (df['lug_boot'].isnull()) |
              (df['safety'].isnull()) |  
             (df['class'].isnull())]

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
864,,vhigh,2,2,small,low,unacc
870,,vhigh,2,2,big,low,unacc
882,,vhigh,2,more,small,low,unacc
996,med,,2,more,big,low,unacc
1007,med,high,3,2,big,,unacc
1020,med,high,3,more,med,,unacc
1075,med,,5more,more,med,med,acc


In [13]:
#Example 6: Load a txt file while specifying column names
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata = pd.read_table(path, sep ='\s+', na_values=['.'], names=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'class'])
mydata.head(5)


Unnamed: 0,x1,x2,x3,x4,x5,x6,class
0,buying,maint,doors,persons,lug_boot,safety,class
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc


In [9]:
#Example 7: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/car_data.txt"

df = pd.read_table(path, sep ='\s+', na_values=['.'])
df.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


#### 312 Data separated by TAB ('\t')

In [61]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data.TAB.txt"
mydata = pd.read_table(path, sep= '\t')
mydata.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
#Example2: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/adult.data.TAB.txt"
mydata = pd.read_table(path, sep= '\t')
mydata.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


For more documentation see pandas.read_table[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html]

### 32 Using read_csv()

#### 321 Data separated by SPACE ('')

In [14]:
# Example 1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata  = pd.read_csv(path, sep =' ')
mydata.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [8]:
# Example 2:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata  = pd.read_csv(path, sep ="\s+")
mydata.head(5)



Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [15]:
#Example 3:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata = pd.read_csv(path, sep ='\s+', header=None)
mydata.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,buying,maint,doors,persons,lug_boot,safety,class
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc


In [16]:
#Example 4: Importing the dataset partially
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata = pd.read_csv(path, sep ='\s+', nrows=80, skiprows=(1,2,5,10,60), usecols=(1,3,5))
print("mydata.head(5)\n",mydata.head(5))
print("Length of mydata = ",len(mydata))

mydata.head(5)
    maint persons safety
0  vhigh       2   high
1  vhigh       2    low
2  vhigh       2   high
3  vhigh       2    low
4  vhigh       2    med
Length of mydata =  80


In [17]:
#Example 5: Specify dot (.) values as missing values
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata = pd.read_csv(path, sep ='\s+', na_values=['.'])
mydata.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [44]:
df2 = mydata
df2[(df2['buying'].isnull()) |
              (df2['maint'].isnull()) |
              (df2['doors'].isnull()) |
              (df2['persons'].isnull()) |
              (df2['lug_boot'].isnull()) |
              (df2['safety'].isnull()) |  
              (df2['class'].isnull())]

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
864,,vhigh,2,2,small,low,unacc
870,,vhigh,2,2,big,low,unacc
882,,vhigh,2,more,small,low,unacc
996,med,,2,more,big,low,unacc
1007,med,high,3,2,big,,unacc
1020,med,high,3,more,med,,unacc
1075,med,,5more,more,med,med,acc


In [18]:
#Example 6: Load a csv while specifying column names
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\car_data.txt"

mydata = pd.read_csv(path, sep ='\s+', na_values=['.'], names=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'class'])
mydata.head(5)

Unnamed: 0,x1,x2,x3,x4,x5,x6,class
0,buying,maint,doors,persons,lug_boot,safety,class
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc


In [11]:
#Example 7: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/car_data.txt"

df = pd.read_csv(path, sep ='\s+', na_values=['.'])
df.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


#### 322 Data separated by TAB ('\t')

In [19]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data.TAB.txt"
mydata = pd.read_csv(path, sep= '\t')
mydata.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
#Example2: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/adult.data.TAB.txt"
mydata = pd.read_csv(path, sep= '\t')
mydata.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


For more documentation see pandas.read_csv [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html]

## 4 Import data in csv format

In [20]:
#Example 1: 
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data.csv"

df  = pd.read_csv(path, sep =',') # df  = pd.read_csv(path) may be too. sep =',' is by deffect.
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [24]:
#Example 2: If no header (title) in raw data file
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data2.csv"

mydata = pd.read_csv(path, sep =',', header=None)
mydata.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [28]:
#Example 3: Add Column Names while loading
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data2.csv"

names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',\
         'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv(path, sep =',', header=None, names = names)
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [39]:
#Example 4: Add Column Names after loading
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data2.csv"

names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',\
         'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv(path, sep =',', header=None)
print("df without column names\n",df.head(5))

print("")
print("=======================================================================================================")
print("")

df.columns = names
print("df with column names\n",df.head(5))

df without column names
    0                  1       2           3   4                    5   \
0  39          State-gov   77516   Bachelors  13        Never-married   
1  50   Self-emp-not-inc   83311   Bachelors  13   Married-civ-spouse   
2  38            Private  215646     HS-grad   9             Divorced   
3  53            Private  234721        11th   7   Married-civ-spouse   
4  28            Private  338409   Bachelors  13   Married-civ-spouse   

                   6               7       8        9     10  11  12  \
0        Adm-clerical   Not-in-family   White     Male  2174   0  40   
1     Exec-managerial         Husband   White     Male     0   0  13   
2   Handlers-cleaners   Not-in-family   White     Male     0   0  40   
3   Handlers-cleaners         Husband   Black     Male     0   0  40   
4      Prof-specialty            Wife   Black   Female     0   0  40   

               13      14  
0   United-States   <=50K  
1   United-States   <=50K  
2   United-States  

In [4]:
#Example 5: Importing the dataset partially
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data.csv"

df = pd.read_csv(path, sep =',', nrows=1200, skiprows=(1,2,5,10,60), usecols=(1,3,5,8,12))
print("df.head(5)\n",df.head(5))
print("Length of df = ",len(df))

df.head(5)
            workclass education          marital-status    race  hours-per-week
0            Private   HS-grad                Divorced   White              40
1            Private      11th      Married-civ-spouse   Black              40
2            Private   Masters      Married-civ-spouse   White              40
3            Private       9th   Married-spouse-absent   Black              16
4   Self-emp-not-inc   HS-grad      Married-civ-spouse   White              45
Length of df =  1200


In [2]:
#Example 6: Specify question mark (' ?') value as missing values
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\adult.data.csv"

df = pd.read_table(path, sep =',', na_values=[' ?'])
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [18]:
# Tell pandas to look for rows with missing values
df[(df['age'].isnull()) |
              (df['workclass'].isnull()) |
              (df['fnlwgt'].isnull()) |
              (df['education'].isnull()) |
              (df['education-num'].isnull()) |
              (df['marital-status'].isnull()) | 
              (df['occupation'].isnull()) |
              (df['relationship'].isnull()) |
              (df['race'].isnull()) |
              (df['sex'].isnull()) |
              (df['capital-gain'].isnull()) |
              (df['capital-loss'].isnull()) |
              (df['hours-per-week'].isnull()) |
              (df['native-country'].isnull()) |
             (df['income'].isnull())]

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,,>50K
27,54,,180211,Some-college,10,Married-civ-spouse,,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K
38,31,Private,84154,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,38,,>50K
51,18,Private,226956,HS-grad,9,Never-married,Other-service,Own-child,White,Female,0,0,30,,<=50K
61,32,,293936,7th-8th,4,Married-spouse-absent,,Not-in-family,White,Male,0,0,40,,<=50K
69,25,,200681,Some-college,10,Never-married,,Own-child,White,Male,0,0,40,United-States,<=50K
77,67,,212759,10th,6,Married-civ-spouse,,Husband,White,Male,0,0,2,United-States,<=50K
93,30,Private,117747,HS-grad,9,Married-civ-spouse,Sales,Wife,Asian-Pac-Islander,Female,0,1573,35,,<=50K
106,17,,304873,10th,6,Never-married,,Own-child,White,Female,34095,0,32,United-States,<=50K
128,35,,129305,HS-grad,9,Married-civ-spouse,,Husband,White,Male,0,0,40,United-States,<=50K


In [13]:
#Example 7: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/adult.data.csv"

df = pd.read_csv(path, sep =',', na_values=['.'])
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


For more documentation see pandas.read_csv [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html]


 ## 5 Import data in Excel format

Reading Excel files is very similar to reading CSV files. By default, the first sheet of the Excel file is read.

In [3]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\Sample_Superstore_Sales.xlsx"

df = pd.read_excel(path)
df.head(5)

Unnamed: 0,Category,City,Container,Customer,Customer_Segment,Customer_Zip_Code,Last_N_days,Order_Date,Order_ID,Order_Priority,...,Number_of_Records,Order_Quantity,Product_Base_Margin,Profit,Profit_Ratio,Row_ID,Sales,Shipping_Cost,Time_to_Ship,Unit_Price
0,Technology,Louisville,Small Pack,Sanjit Jacobs,Consumer,80027,1778,2011-09-07,52128,Not Specified,...,1,11,0.61,-28,-2.0459,7309,14,3,1,1
1,Technology,Danville,Small Pack,Katherine Hughes,Consumer,24541,1204,2013-04-03,59680,Critical,...,1,15,0.61,-38,-2.1199,8347,18,3,1,1
2,Office Supplies,Thornton,Wrap Bag,Christopher Martinez,Small Business,80229,1871,2011-06-06,31204,Low,...,1,8,0.38,-1,-0.0851,4380,10,1,0,1
3,Office Supplies,Casas Adobes,Wrap Bag,Quincy Jones,Corporate,85704,1284,2013-01-13,55299,Low,...,1,20,0.38,-3,-0.1194,7717,22,1,0,1
4,Office Supplies,Eagle,Wrap Bag,Darren Powers,Corporate,83616,2436,2009-11-18,38529,Low,...,1,38,0.38,0,-0.0109,5421,45,1,0,1


In [7]:
#Example2: Passing the sheetname method 
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\Sample_Superstore_Sales.xlsx"

df = pd.read_excel(path, sheetname = 'Superstore2')
df.head(5)

Unnamed: 0,Category,City,Container,Customer,Customer_Segment,Customer_Zip_Code,Last_N_days
0,Technology,Louisville,Small Pack,Sanjit Jacobs,Consumer,80027,1778
1,Technology,Danville,Small Pack,Katherine Hughes,Consumer,24541,1204
2,Office Supplies,Thornton,Wrap Bag,Christopher Martinez,Small Business,80229,1871
3,Office Supplies,Casas Adobes,Wrap Bag,Quincy Jones,Corporate,85704,1284
4,Office Supplies,Eagle,Wrap Bag,Darren Powers,Corporate,83616,2436


If you aren’t sure what are the names of your sheets, you can pick them by their order. Please note that the sheets start from 0 (similar to indices in pandas), not from 1.

In [9]:
#Example3: Passing the sheetname method by their order
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\Sample_Superstore_Sales.xlsx"

df = pd.read_excel(path, sheetname = 1)
df.head(5)

Unnamed: 0,Category,City,Container,Customer,Customer_Segment,Customer_Zip_Code,Last_N_days
0,Technology,Louisville,Small Pack,Sanjit Jacobs,Consumer,80027,1778
1,Technology,Danville,Small Pack,Katherine Hughes,Consumer,24541,1204
2,Office Supplies,Thornton,Wrap Bag,Christopher Martinez,Small Business,80229,1871
3,Office Supplies,Casas Adobes,Wrap Bag,Quincy Jones,Corporate,85704,1284
4,Office Supplies,Eagle,Wrap Bag,Darren Powers,Corporate,83616,2436


In [16]:
#Example 4: Add Column Names after loading
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\Sample_Superstore_Sales.xlsx"

names = ['State', 'Sub-Category', 'Discount', 'Number_of_Records', 'Order_Quantity', 'Product_Base_Margin', 'Profit',\
	'Profit_Ratio', 'Row_ID', 'Sales', 'Shipping_Cost', 'Time_to_Ship', 'Unit_Price']

df = pd.read_excel(path, sheetname = 'Superstore3', header=None)
print("df without column names\n",df.head(5))

print("")
print("=======================================================================================================")
print("")

df.columns = names
print("df with column names\n",df.head(5))

df without column names
          0                     1     2   3   4     5   6       7     8   9   \
0  Colorado  Computer Peripherals  0.03   1  11  0.61 -28 -2.0459  7309  14   
1  Virginia  Computer Peripherals  0.05   1  15  0.61 -38 -2.1199  8347  18   
2  Colorado          Rubber Bands  0.00   1   8  0.38  -1 -0.0851  4380  10   
3   Arizona          Rubber Bands  0.09   1  20  0.38  -3 -0.1194  7717  22   
4     Idaho          Rubber Bands  0.02   1  38  0.38   0 -0.0109  5421  45   

   10  11  12  
0   3   1   1  
1   3   1   1  
2   1   0   1  
3   1   0   1  
4   1   0   1  


df with column names
       State          Sub-Category  Discount  Number_of_Records  \
0  Colorado  Computer Peripherals      0.03                  1   
1  Virginia  Computer Peripherals      0.05                  1   
2  Colorado          Rubber Bands      0.00                  1   
3   Arizona          Rubber Bands      0.09                  1   
4     Idaho          Rubber Bands      0.02       

In [14]:
#Example 5: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/Sample_Superstore_Sales.xlsx"

df = pd.read_excel(path)
df.head(5)

Unnamed: 0,Category,City,Container,Customer,Customer_Segment,Customer_Zip_Code,Last_N_days,Order_Date,Order_ID,Order_Priority,...,Number_of_Records,Order_Quantity,Product_Base_Margin,Profit,Profit_Ratio,Row_ID,Sales,Shipping_Cost,Time_to_Ship,Unit_Price
0,Technology,Louisville,Small Pack,Sanjit Jacobs,Consumer,80027,1778,2011-09-07,52128,Not Specified,...,1,11,0.61,-28,-2.0459,7309,14,3,1,1
1,Technology,Danville,Small Pack,Katherine Hughes,Consumer,24541,1204,2013-04-03,59680,Critical,...,1,15,0.61,-38,-2.1199,8347,18,3,1,1
2,Office Supplies,Thornton,Wrap Bag,Christopher Martinez,Small Business,80229,1871,2011-06-06,31204,Low,...,1,8,0.38,-1,-0.0851,4380,10,1,0,1
3,Office Supplies,Casas Adobes,Wrap Bag,Quincy Jones,Corporate,85704,1284,2013-01-13,55299,Low,...,1,20,0.38,-3,-0.1194,7717,22,1,0,1
4,Office Supplies,Eagle,Wrap Bag,Darren Powers,Corporate,83616,2436,2009-11-18,38529,Low,...,1,38,0.38,0,-0.0109,5421,45,1,0,1


For more documentation see pandas.read_excel [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html#pandas.read_excel]

## 6 Import data in JSON format

JSON stands for JavaScript Object Notation.

JSON provides a simpler, more human-readable syntax for exchanging data between different software components and systems. Processing JSON data is fast and easy unlike the complex process of parsing and writing XML files. Most modern programming languages currently support JSON natively.

In [4]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\mydata.json"
df = pd.read_json(path)
df.head(5)

Unnamed: 0,assignee,assignees,author_association,body,closed_at,comments,comments_url,created_at,events_url,html_url,...,locked,milestone,number,pull_request,repository_url,state,title,updated_at,url,user
0,,[],CONTRIBUTOR,closes #18324\r\n,NaT,2,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 13:39:04,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19247,...,False,{'url': 'https://api.github.com/repos/pandas-d...,19247,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,DEPR: change Panel DeprecationWarning -> Futur...,2018-01-15 15:11:40,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jreback', 'id': 953992, 'avatar_url..."
1,,[],NONE,- [x] closes #19171\r\n- [x] tests passed\r\n-...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 12:42:02,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19246,...,False,,19246,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,CLN: Refactor Index._validate_names(),2018-01-15 14:12:36,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'PoppyBagel', 'id': 34628304, 'avata..."
2,,[],NONE,- [X] closes #12509,NaT,1,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 06:00:54,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19245,...,False,,19245,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,Doc: Adds example of categorical data for effi...,2018-01-15 14:00:41,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'pdpark', 'id': 16848166, 'avatar_ur..."
3,,[],CONTRIBUTOR,- [x] closes #19242 \r\n- [x] whatsnew entry\r\n,NaT,1,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 04:08:46,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19244,...,False,,19244,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,BUG: unsupported type Interval when writing da...,2018-01-15 13:35:00,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'cbertinato', 'id': 20772838, 'avata..."
4,,[],CONTRIBUTOR,`Block` and `SparseBlock` each have a `reindex...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 02:50:40,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/19243,...,False,,19243,,https://api.github.com/repos/pandas-dev/pandas,open,"core.internals checking `hasattr(item, 'reinde...",2018-01-15 02:50:40,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jbrockmendel', 'id': 8078968, 'avat..."


In [7]:
#Example2: Import File from URL
path = "https://raw.githubusercontent.com/arqmain/Python/master/Pandas/Project2/mydata.json"
df = pd.read_json(path)
df.head(5)

Unnamed: 0,assignee,assignees,author_association,body,closed_at,comments,comments_url,created_at,events_url,html_url,...,locked,milestone,number,pull_request,repository_url,state,title,updated_at,url,user
0,,[],CONTRIBUTOR,closes #18324\r\n,NaT,2,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 13:39:04,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19247,...,False,{'url': 'https://api.github.com/repos/pandas-d...,19247,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,DEPR: change Panel DeprecationWarning -> Futur...,2018-01-15 15:11:40,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jreback', 'id': 953992, 'avatar_url..."
1,,[],NONE,- [x] closes #19171\r\n- [x] tests passed\r\n-...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 12:42:02,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19246,...,False,,19246,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,CLN: Refactor Index._validate_names(),2018-01-15 14:12:36,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'PoppyBagel', 'id': 34628304, 'avata..."
2,,[],NONE,- [X] closes #12509,NaT,1,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 06:00:54,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19245,...,False,,19245,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,Doc: Adds example of categorical data for effi...,2018-01-15 14:00:41,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'pdpark', 'id': 16848166, 'avatar_ur..."
3,,[],CONTRIBUTOR,- [x] closes #19242 \r\n- [x] whatsnew entry\r\n,NaT,1,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 04:08:46,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/19244,...,False,,19244,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,BUG: unsupported type Interval when writing da...,2018-01-15 13:35:00,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'cbertinato', 'id': 20772838, 'avata..."
4,,[],CONTRIBUTOR,`Block` and `SparseBlock` each have a `reindex...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2018-01-15 02:50:40,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/19243,...,False,,19243,,https://api.github.com/repos/pandas-dev/pandas,open,"core.internals checking `hasattr(item, 'reinde...",2018-01-15 02:50:40,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jbrockmendel', 'id': 8078968, 'avat..."


For more documentation see pandas.read_json [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html#pandas.read_json]


## 7 Import data in ZIP format

In [5]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\FourFiles.zip"

zf = zipfile.ZipFile(path)
df = pd.read_csv(zf.open('vehicles.csv'))
df.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,...,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [13]:
#Example2: Get namelist of files in FourFiles.zip
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\FourFiles.zip"

stories_zip = zipfile.ZipFile('C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\FourFiles.zip')
print("",stories_zip.namelist())
stories_zip.close()

 ['vehicles.csv', 'Sample_Superstore_Sales.xlsx', 'adult.data.xlsx', 'adult.data.txt']


In [14]:
#Import file 'Sample_Superstore_Sales.xlsx' which is in FourFiles.zip
zf = zipfile.ZipFile(path)
df = pd.read_excel(zf.open('Sample_Superstore_Sales.xlsx'))
df.head(5)

Unnamed: 0,Category,City,Container,Customer,Customer_Segment,Customer_Zip_Code,Last_N_days,Order_Date,Order_ID,Order_Priority,...,Number_of_Records,Order_Quantity,Product_Base_Margin,Profit,Profit_Ratio,Row_ID,Sales,Shipping_Cost,Time_to_Ship,Unit_Price
0,Technology,Louisville,Small Pack,Sanjit Jacobs,Consumer,80027,1778,2011-09-07,52128,Not Specified,...,1,11,0.61,-28,-2.0459,7309,14,3,1,1
1,Technology,Danville,Small Pack,Katherine Hughes,Consumer,24541,1204,2013-04-03,59680,Critical,...,1,15,0.61,-38,-2.1199,8347,18,3,1,1
2,Office Supplies,Thornton,Wrap Bag,Christopher Martinez,Small Business,80229,1871,2011-06-06,31204,Low,...,1,8,0.38,-1,-0.0851,4380,10,1,0,1
3,Office Supplies,Casas Adobes,Wrap Bag,Quincy Jones,Corporate,85704,1284,2013-01-13,55299,Low,...,1,20,0.38,-3,-0.1194,7717,22,1,0,1
4,Office Supplies,Eagle,Wrap Bag,Darren Powers,Corporate,83616,2436,2009-11-18,38529,Low,...,1,38,0.38,0,-0.0109,5421,45,1,0,1


For more documentation see 'Work with ZIP archives' [https://docs.python.org/3/library/zipfile.html]

## 8 Import data in SAS format [sas7bdat]

In [20]:
#Example:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\metrics.sas7bdat"

df = pd.read_sas(path)
df.head(5)

Unnamed: 0,SALARY,GPA,METRICS,FEMALE
0,29555.0,3.77,2.316429e-317,1.0
1,27958.0,3.4,2.197677e-316,2.031097e-316
2,27230.0,2.74,2.197719e-316,2.197583e-316
3,31070.0,3.88,2.435467e-317,1.0
4,27577.0,2.7,2.354104e-317,1.0


For more documentation see pandas.read_sas [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html#pandas.read_sas]


## 9 Import data in STATA format

In [25]:
#Example1:
path = "C:\\Users\\Alvaro\\Documents\\Python_Projects\\Pandas\\Pandas-Project2\\olympics.dta"

df = pd.read_stata(path)
df.head(5)

Unnamed: 0,country,year,gdp,pop,gold,silver,bronze,medaltot,host,planned,soviet
0,1.0,80.0,26600000000.0,16000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,84.0,29900000000.0,17600000.0,,,,,0.0,0.0,0.0
2,2.0,80.0,2450000000.0,2671000.0,,,,,0.0,1.0,0.0
3,2.0,84.0,2660000000.0,2897000.0,,,,,0.0,1.0,0.0
4,2.0,88.0,2800000000.0,3138000.0,,,,,0.0,1.0,0.0


In [26]:
#Example2:
path = "http://www.principlesofeconometrics.com/stata/consumption.dta"

df = pd.read_stata(path)
df.head(5)

Unnamed: 0,inc,cons,dur
0,8369.0,7537.0,428.0
1,8436.0,7651.0,434.0
2,8567.0,7655.0,404.0
3,8692.0,7885.0,475.0
4,8775.0,7947.0,491.0


For more documentation see pandas.read_stata [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_stata.html#pandas.read_stata]

## 10 Import data in XML format

Unfortunately Pandas package does not have a function to import data from XML so we need to use standard packages and do some extra work to convert the data to Pandas DataFrames.

I will approach this item with two examples. The first uses the XML library and the second uses the LXML library as support bases.

Both procedures require knowing the amount and name of the fields contained in the xml file to be imported. In addition, the examples discussed here are relatively simple and do not consider the way of working directly with the web.

In [60]:
# Example1: Use xml.etree.ElementTree library

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None
 
 
def main():
    parsed_xml = et.parse("mydata1.xml")
    dfcols = ['name', 'email', 'phone', 'street']
    df_xml = pd.DataFrame(columns=dfcols)
 
    for node in parsed_xml.getroot():
        name = node.attrib.get('name')
        email = node.find('email')
        phone = node.find('phone')
        street = node.find('address/street')
 
        df_xml = df_xml.append(
            pd.Series([name, getvalueofnode(email), getvalueofnode(phone),
                       getvalueofnode(street)], index=dfcols),
            ignore_index=True)
 
    print ("'mydata1.xml' imported as pandas dataframe\n\n",df_xml)
main()

'mydata1.xml' imported as pandas dataframe

      name             email     phone           street
0  gokhan  gokhan@gmail.com  555-1234  Michigan Avenue
1    mike    mike@gmail.com      None             None
2    john    john@gmail.com  555-4567             None
3   david              None  555-6472     Fifth Avenue
4  Robert  rober1@gmail.com  555-2431       Tawin Road
5   Rayen   rayen@yahoo.com  555-9299    Spring Street


In [37]:
# Example2: Use function 'objectify' from lxml library
##from lxml import objectify

path = 'mydata2.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df2 = pd.DataFrame(columns=('ID', 'String', 'Description', 'Type', 'Comment', 'Link1', 'Link2'))

for i in range(0,3):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['ID', 'String', 'Description', 'Type', 'Comment', 'Link1', 'Link2'], [obj[0].text, obj[1].text, obj[2].text, obj[3].text, obj[4].text, obj[5].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df2 = df2.append(row_s)

df2

Unnamed: 0,ID,String,Description,Type,Comment,Link1,Link2
0,id_a_f_3,!Susie (http://www.sync2it.com/susie),Sync2It bookmark management & clustering engine,C R,,http://www.sync2it.com,
1,id_a_f_6,<a href='http://www.unchaos.com/'> UnChaos </a...,UnCHAOS search robot,R,Site is dead,http://www.unchaos.com/,
2,id_a_f_7,<a href='http://www.unchaos.com/'> UnChaos Bot...,UnCHAOS search robot,R,Site is dead,http://www.unchaos.com/,


<hr>
By: Hector Alvaro Rojas &nbsp;&nbsp;|&nbsp;&nbsp; Data Science, Visualizations and Applied Statistics &nbsp;&nbsp;|&nbsp;&nbsp; January 16, 2018<br>
    Url: [http://www.arqmain.net]   &nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;   GitHub: [https://github.com/arqmain]
    <hr>