 #  IO Tools
 
 https://www.tutorialspoint.com/python_pandas/python_pandas_io_tool.htm
The Pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object.

The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both use the same parsing code to intelligently convert tabular data into a DataFrame object −

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None

pandas.read_csv(filepath_or_buffer, sep='\t', delimiter=None, header='infer',
names=None, index_col=None, usecols=None

Here is how the csv file data looks like −

S.No,Name,Age,City,Salary

1,Tom,28,Toronto,20000

2,Lee,32,HongKong,3000

3,Steven,43,Bay Area,8300

4,Ram,38,Hyderabad,3900

Save this data as temp.csv and conduct operations on it.

In [20]:
import pandas as pd


In [21]:
df=pd.read_csv("temp.csv")
print (df)

       total_bill    tip         sex    smoker    day      time    size
0           16.99   1.01     Female         No    Sun    Dinner       2
1           10.34   1.66        Male        No    Sun    Dinner       3
2           21.01   3.50        Male        No    Sun    Dinner       3
3           23.68   3.31        Male        No    Sun    Dinner       2
4           24.59   3.61      Female        No    Sun    Dinner       4


In [22]:
#custom index
# This specifies a column in the csv file to customize the index using index_col.
#df=pd.read_csv("temp.csv",index_col=['S.No'])# error Index S.No invalid
#print (df)

In [23]:
import numpy as np
# Converters
# ndtype of the columns can be passed as a dict.
df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print (df.dtypes)

    total_bill    float64
  tip             float64
     sex           object
  smoker           object
  day              object
     time          object
  size              int64
dtype: object


In [24]:
#By default, the dtype of the Salary column is int, but the result shows it as float 
#because we have explicitly casted the type.
# Thus, the data looks like float −
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [25]:
# header_names
# Specify the names of the header using the names argument. 

In [26]:
df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
df 

Unnamed: 0,Unnamed: 1,a,b,c,d,e
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.50,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
# Observe, the header names are appended with the custom names, but the header in the file 
# has not been eliminated. 
# Now, we use the header argument to remove that.
# If the header is in a row other than the first, pass the row number to header.
# This will skip the preceding rows.

df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
df

Unnamed: 0,Unnamed: 1,a,b,c,d,e
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4


In [28]:
#skiprows skips the number of rows specified.
df=pd.read_csv("temp.csv", skiprows=2)
df  

Unnamed: 0,16.99,1.01,Female,No,Sun,Dinner,2
0,10.34,1.66,Male,No,Sun,Dinner,3
1,21.01,3.5,Male,No,Sun,Dinner,3
2,23.68,3.31,Male,No,Sun,Dinner,2
3,24.59,3.61,Female,No,Sun,Dinner,4


#  Sparse Data
https://www.tutorialspoint.com/python_pandas/python_pandas_sparse_data.htm
Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted. 

A special SparseIndex object tracks where data has been “sparsified”. This will make much more sense in an example. All of the standard Pandas data structures apply the to_sparse method −

In [29]:
ts = pd.Series(np.random.randn(10))
print (ts,'\n' )
ts[2:-2] = np.nan
print (ts,'\n' )
sts = ts.to_sparse()
print (sts )

0    0.397758
1   -0.902610
2    0.563284
3    1.657133
4   -0.176145
5    0.881174
6   -0.979461
7    0.779686
8    1.250653
9    0.147026
dtype: float64 

0    0.397758
1   -0.902610
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    1.250653
9    0.147026
dtype: float64 

0    0.397758
1   -0.902610
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    1.250653
9    0.147026
dtype: Sparse[float64, nan]
BlockIndex
Block locations: array([0, 8])
Block lengths: array([2, 2])


In [30]:
 
df = pd.DataFrame(np.random.randn(10000, 4))
df.ix[:9998] = np.nan
sdf = df.to_sparse()


print(df.ix[9999])
print('\n\n',df.ix[9999][1])
print ('density = ',sdf.density)

0   -1.859127
1    0.677000
2   -0.510576
3   -0.886306
Name: 9999, dtype: float64


 0.6769995256319982
density =  0.0001


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


# Sparse Dtypes

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and booldtypes are supported. Depending on the original dtype, fill_value default changes −

    float64 − np.nan

    int64 − 0

    bool − False
 


In [31]:
s = pd.Series([1, np.nan, np.nan])
print (s)

s.to_sparse()
print (s)

0    1.0
1    NaN
2    NaN
dtype: float64
0    1.0
1    NaN
2    NaN
dtype: float64


# Caveats & Gotchas 
Caveats means warning and gotcha means an unseen problem.

https://www.tutorialspoint.com/python_pandas/python_pandas_caveats_and_gotchas.htm

<b>Using If/Truth Statement with Pandas</b>

Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be. Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead, Pandas raises a ValueError −

In [32]:
#if pd.Series([False, True, False]):
 #  print ('I am True') 
#ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [33]:
 if pd.Series([False, True, False]).any():
   print("I am any")

I am any


In [34]:
 # To evaluate single-element pandas objects in a Boolean context, 
 # use the method .bool() −

print (pd.Series([True]).bool())

True


In [35]:
#Bitwise Boolean

#Bitwise Boolean operators like == and != will return a Boolean series, 
#which is almost always what is required anyways.

s = pd.Series(range(5))
print (s )
print (s==4 )

0    0
1    1
2    2
3    3
4    4
dtype: int64
0    False
1    False
2    False
3    False
4     True
dtype: bool


In [36]:
s = pd.Series(list('abc'))
print (s )
s = s.isin(['a', 'c', 'e'])
print (s )

0    a
1    b
2    c
dtype: object
0     True
1    False
2     True
dtype: bool


In [37]:
# Reindexing vs ix Gotcha

#using the ix indexing capabilities as a concise means 
# of selecting data from a Pandas object 
df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three',
'four'],index=list('abcdef'))

print (df,'\n\n')
print (df.ix[['b', 'c', 'e']],'\n\n')
print (df.loc[['b', 'c', 'e']])

        one       two     three      four
a  1.244768 -0.416888 -1.804559 -2.427946
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
d -0.528927  1.541165  2.343681  1.504311
e -0.809158  0.006845  0.874882 -0.278007
f -1.167694  0.131706  0.077056 -0.977912 


        one       two     three      four
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
e -0.809158  0.006845  0.874882 -0.278007 


        one       two     three      four
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
e -0.809158  0.006845  0.874882 -0.278007


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':


In [38]:
#This is, of course, completely equivalent in this case to using the 
# reindex method −
print (df.reindex(['b', 'c', 'e']))

        one       two     three      four
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
e -0.809158  0.006845  0.874882 -0.278007


In [39]:
#Some might conclude that ix and reindex are 100% equivalent based on this. 
#This is true except in the case of integer indexing. 
#For example, the above operation can alternatively be  expressed as −
print (df)
print (df.ix[[1, 2, 4]])
print (df.reindex([1, 2, 4]))

#It is important to remember that reindex is strict label indexing only. 
#This can lead to some potentially surprising results in pathological
#cases where an index contains, say, both integers and strings.

        one       two     three      four
a  1.244768 -0.416888 -1.804559 -2.427946
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
d -0.528927  1.541165  2.343681  1.504311
e -0.809158  0.006845  0.874882 -0.278007
f -1.167694  0.131706  0.077056 -0.977912
        one       two     three      four
b  0.490420  0.797101 -0.703248 -1.270402
c  0.333451  0.678543 -0.034102 -0.561344
e -0.809158  0.006845  0.874882 -0.278007
   one  two  three  four
1  NaN  NaN    NaN   NaN
2  NaN  NaN    NaN   NaN
4  NaN  NaN    NaN   NaN


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """


# Comparison with SQL
https://www.tutorialspoint.com/python_pandas/python_pandas_comparison_with_sql.htm

In [40]:
import pandas as pd
url = 'temp.csv'
tips= pd.read_csv(url, header=0)
print (tips.head())


       total_bill    tip         sex    smoker    day      time    size
0           16.99   1.01     Female         No    Sun    Dinner       2
1           10.34   1.66        Male        No    Sun    Dinner       3
2           21.01   3.50        Male        No    Sun    Dinner       3
3           23.68   3.31        Male        No    Sun    Dinner       2
4           24.59   3.61      Female        No    Sun    Dinner       4


In [41]:
df = pd.read_csv(url, index_col= 0)
df

Unnamed: 0_level_0,tip,sex,smoker,day,time,size
total_bill,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4


In [42]:
df = pd.read_csv(url, usecols = [0,1,2])
df

Unnamed: 0,total_bill,tip,sex
0,16.99,1.01,Female
1,10.34,1.66,Male
2,21.01,3.5,Male
3,23.68,3.31,Male
4,24.59,3.61,Female


In [50]:
#url ='https://raw.githubusercontent.com/datagy/pivot_table_pandas/master/select_columns.csv'
url2 ='temp2.csv'
df = pd.read_csv(url2)
print(df )

selection = df.loc[:2,'Name']
print('\nselection = \n',selection)

      Name  Age Height  Score  Random_A  Random_B  Random_C  Random_D  \
0      Joe   28    5'9     30        73        59         5         4   
1  Melissa   26    5'5     32        30        85        38        32   
2      Nik   31   5'11     34        80        71        59        71   
3   Andrea   33    5'6     38        16        63        86        81   
4     Jane   32    5'8     29        19        40        48         5   
5   gumgum   28    5'9     30        83        59         5         4   

   Random_E  
0        31  
1        80  
2        53  
3        42  
4        68  
5        31  

selection = 
 0        Joe
1    Melissa
2        Nik
Name: Name, dtype: object


# SELECT

In SQL, selection is done using a comma-separated list of columns that you select (or a * to select all columns) −

SELECT Name,  Age, Height,  Score
FROM selection
WHERE  Name   =  'Joe'
LIMIT 5;

With Pandas, column selection is done by passing a list of column names to your DataFrame −

selection[['Name,  Age, Height,  Score]].head(5)


In [64]:
df = pd.read_csv(url2)
#pd.read_csv(url,  header =[1, 2]) 
# tips['total_bill', 'tip', 'smoker', 'time']
selection = df.loc[:6,['Name', 'Age', 'Height', 'Score']]
print(selection,' \n\n')
            
print (selection[selection['Name'] ==  'Joe'].head(5))# WHERE  Name   =  'Joe'

      Name  Age Height  Score
0   Andrea   33    5'6     38
1      Joe   28    5'9     30
2  Melissa   26    5'5     32
3      Nik   31   5'11     34
4   Andrea   33    5'6     38
5   Andrea   33    5'6     38
6     Jane   32    5'8     29  


  Name  Age Height  Score
1  Joe   28    5'9     30


# GroupBy

This operation fetches the count of records in each group throughout a dataset. For instance, a query fetching us the number of tips left by sex −

SELECT Name, count(*)
FROM tips
GROUP BY Name;


In [65]:
print (selection.groupby('Name').size())

Name
Andrea     3
Jane       1
Joe        1
Melissa    1
Nik        1
dtype: int64
