# Reading Data In Python
Tabular data files may be formatted in many ways including:
* Free formatted – separated by spaces
* Delimited – separated by other characters
* Quoted strings – strings contain the delimiter
* Aligned – each variable always in same location
* Multiple lines – each case spans several lines
* Structured – e.g. "marked up" as with XML
* Unstructured – e.g. an interview transcript
* Mixed – any combination of the above

Data are arranged in tables because the tools we typically use require tables.

The Python package that is commonly used for tabular data is **Pandas**. The Pandas DataFrame is the table structure we will use. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html




## CSV Files

The first file we'll read is a Comma Separated Variable **(CSV)** file. CSV files are becoming a very common way of formatting raw data.  Se, for example, https://en.wikipedia.org/wiki/Comma-separated_values 

We'll use the Pandas **read_csv** function to read the file see: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html 

The data file looks like the following:

`Name,Species,Gender,Weight,Fur,Temper,Bites
102,2,BOY,8,M,2,9
103,1,GIRL,31,11,3,14
104,1,BOY,26,12,1,3
105,2,BOY,14,9,3,15
106,1,GIRL,64,7,3,16
107,2,GIRL,15,3,M,10
108,1,GIRL,9,17,2,11
109,1,BOY,38,4,1,2
110,2,GIRL,12,14,2,12
111,1,BOY,41,2,3,17
112,1,BOY,52,10,1,4
113,2,GIRL,9,10,1,3
`

### Setup
First import the packages we need and define paths to folders we'll use. It is good practice to put this stuff at the top of a project. Then, if you use a different folder you don't need to look through the whole script to make changes.

In [17]:
# import the pandas package with the local name "pd"
import pandas as pd

# the numpy package for datatypes
import numpy as np

# capture the path to the data directory
dataDirectory = r'/Users/cagilalbayrak/Downloads/792 Managing Research Data/data/reading1' + '/'
# the 'r' before the quotes tell Python to treat this string as a RAW string, 
# meaning that it should read it literally (i.e. to not interpret the backslash as an escape character in this case)


# The Pandas read_csv method
The Pandas read_csv method can import a CSV file into a Pandas DataFrame.

In [18]:
# import the csv
dogsDf = pd.read_csv(dataDirectory+"DOGGYDATA.csv")

In [19]:
# display the DataFrame
dogsDf

Unnamed: 0,Name,Species,Gender,Weight,Fur,Temper,Bites
0,102,2,BOY,8,M,2,9
1,103,1,GIRL,31,11,3,14
2,104,1,BOY,26,12,1,3
3,105,2,BOY,14,9,3,15
4,106,1,GIRL,64,7,3,16
5,107,2,GIRL,15,3,M,10
6,108,1,GIRL,9,17,2,11
7,109,1,BOY,38,4,1,2
8,110,2,GIRL,12,14,2,12
9,111,1,BOY,41,2,3,17


In [10]:
print("The typeof dogsDf is: ", type(dogsDf))

NameError: name 'dogsDf' is not defined

### Information about the DataFrame
The info() method returns basic information about the DataFrame

In [None]:
dogsDf.info()

### Let's get descriptive statistics for the numeric variables. 
What happened to the Fur, Gender, and Temper columns? They are not numeric.

In [None]:
dogsDf.describe()

### selecting a Series from a DataFrame
There are several methods of subsetting DataFrames. In the example below the Gender Series is extracted just like a value from a dictionary.

In [None]:
GenderSeries = dogsDf['Gender']
GenderSeries

In [None]:
type(GenderSeries)

### Using dot notation to select a column

In [None]:
dogsDf.Fur

### Value counts for categorical variables 
Text or numeric variables may be categorical - having a specific set of values. Counts of each value can be obtained with the value_counts method of a Series.

In [None]:
# get the value counts for Gender
dogsDf['Gender'].value_counts()

### Numeric columns can be tabulated too

In [None]:
# get the value counts for a numeric categorical column
dogsDf['Species'].value_counts()

### Tabulating a non-categorical numeric column may not be helpful
Although it might be useful for finding duplicate values where none should exist. With weight, duplicates are possible.

In [None]:
# get the value counts for a numeric column
dogsDf['Weight'].value_counts()

### Notice that value_counts() returns a Series

In [None]:
speciesTab = dogsDf['Species'].value_counts()
type(speciesTab)

### Inferred Datatypes
Because of the M in row 1, Fur is read as of type "object" - the most genera datatype. Each value in the column, though, may have a more specific datatype, in this case a string.
Trying to compute the mean of a set of strings doesn't work so well.
Can you figure out why the funny value 'M11129731741421010'? in the error message?


In [None]:
print('the value of Fur in row 1 is: ',
      dogsDf.Fur[0],
        'of type: ', type(dogsDf.Fur[0]))
print('the value of Fur in row 2 is: ',
      dogsDf.Fur[1],
        'of type: ', type(dogsDf.Fur[1]))
print('The mean of Fur is: ', dogsDf.Fur.mean())

### The importance of Metadata

In order to really understand these data we need some data about the data - **metadata**. 
The data comes with a codebook that describes the data and the variables in the dataset. The documentation for the fur variable is:

    Fur - Fur Length
    Type	Numeric (Integer)
    Measurement Unit	cm
    Numeric Details	Decimals: 0
    Description	Length of fur in cm

    Missing Value
    M


### Getting explicit - specifying a datatype.
The default for read_csv is to make any column that has a text value have the datatype object. We need to make this column numeric and read the "M" as a missing value. The read_csv method has options that can do this.

First define the Fur and Temper columns as floating point numbers.

In [None]:
# import the csv with Fur and Temper as numeric
dogsDf = pd.read_csv(dataDirectory+"DOGGYDATA.csv", 
                     dtype={'Fur':np.float64, 'TEMPER':np.float64})

### Missing Data
The error message above says that an "M" can't be read as a number. We need to describe the string "M" as a **sentinel value** that stands for missing. Sentinel values are values in the data that are not part of the **substantive value domain**, the possible values for our measurement.
In Pandas there are the values NaN 
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html 


In [None]:
#Oops we need to tell Python what an "M" means
dogsDf = pd.read_csv(dataDirectory+"DOGGYDATA.csv",  
                        dtype={'Fur':np.float64, 
                               'TEMPER':np.float64}, 
                        na_values=['M'])

print('the value of Fur in row 1 is: ',
      dogsDf.Fur[0],
        'of type: ', type(dogsDf.Fur[0]))
dogsDf

We can see the datatypes of the columns in DogsDf by looking at the "dtypes" attribute, or the "info()" method of the DataFrame.
The default for the columns that had no decimal places is integer. We forced the Fur and Temper columns to be floating point values.

In [None]:
dogsDf.dtypes

In [None]:
dogsDf.info()

### How are missing values managed?
By default missing values are excluded. Note that there are 12 rows in the table, but these counts add up to 11.

In [None]:
dogsDf['Temper'].value_counts()

The missing values can be counted by specifying dropna=False

In [None]:
dogsDf['Temper'].value_counts(dropna=False)

CSV files are a special case of **Delimited Files**. To read other kinds of delimited files use pd.read_table and specify a delimeter parameter.

## Fixed Column Layouts
The file below is arranged in fixed columns. It could be read as a space delimited file, but one could also specify the row positions in which each dataframe column is found. The first variable, for example, is found in positions 0 to 2 of each record.

Note that there are no column names in this file. They will need to be specified in the Python code that reads the file. The method used to read a file like this is pd.read_fwf()  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html

`
102 2  BOY   08   M      2   09
103 1  GIRL  31   11     3   14
104 1  BOY   26   12     1   03
105 2  BOY   14   09     3   15
106 1  GIRL  64   07     3   16
107 2  GIRL  15   03     M   10
108 1  GIRL  09   17     2   11
109 1  BOY   38   04     1   02
110 2  GIRL  12   14     2   12
111 1  BOY   41   02     3   17
112 1  BOY   52   10     1   04
113 2  GIRL  09   10     1   03
`

### Inferring the Columns
pd.read_fwf() can infer the columns by looking at the first few rows (defaults to 100 rows). 
We still have to tell it that there is no header row and what the column names should be.

In [None]:

dogsDf2 = pd.read_fwf(dataDirectory+"column.dat", 
                      header=None, 
                      names=['Name', 'Species', 'Gender', 
                             'Weight', 'Fur', 'Temper', 'Bites'])
dogsDf2

### Being Specific About the Columns
Sometimes it is necessary to be explicit about where the dataframe columns can be found in the record.
**IMPORTANT** note that the tuples describe half-open intervals, i.e. (0,3) means starting at 0 up to but not including 3. In other words columns 0, 1, and 2.
Note also the names and na_values parameters are not described in the documentation for pd.read_fwf. Instead they are described in the parent object of that function pandas textfilereader. See the "online docs for IO Tools." referenced in the read_fwf() documentation. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html 

Note that the following code relies on the "positions" list and the "namesList" list to be in the same order. If the two orders are different the column names will be wrong.

In [None]:
positions = [(0,3), 
             (4,5), 
             (7,11), 
             (13,15), 
             (18,20), 
             (25,26), 
             (29,31)]
namesList=['Name', 
           'Species', 
           'Gender', 
           'Weight', 
           'Fur', 
           'Temper', 
           'Bites']
# using default datatypes
dogsDf2 = pd.read_fwf(dataDirectory+"column.dat", 
                      header=None, 
                      colspecs=positions,
                      names=namesList,
                     na_values=['M'])

print("the first time datatype of the Bites column is: ", dogsDf2.Bites.dtype )

# using explicit datatypes
dogsDf2 = pd.read_fwf(dataDirectory+"column.dat", 
                      header=None, 
                      colspecs=positions,
                      names=namesList,
                     na_values=['M'],
                     dtype={"Name":str, "Bites":float})
print("the second time datatype of the Bites column is: ", dogsDf2.Bites.dtype )

print("the datatype of the Name column is: ", dogsDf2.Name.dtype )

dogsDf2

In [None]:
dogsDf2.info()

### Inferrence Can Fail
In some cases it is not possible for read_fwf to infer the columns. In the following example two abutting DataFrame columns have been added:

Rating in positions 34 through 48

and 

Sequence in positions 49 through 50
(remember that the first position is denoted as "0")

There are also some numerals in position 51 to be ignored

Here is the data file

`
102 2  BOY   08   M      2   09   blarney        1  
103 1  GIRL  31   11     3   14   ridiculousstuff2  
104 1  BOY   26   12     1   03   absurd         3  
105 2  BOY   14   09     3   15   silly          4  
106 1  GIRL  64   07     3   16   completely nuts5  
107 2  GIRL  15   03     M   10   ludicrous      6  
108 1  GIRL  09   17     2   11   preposterous   7  
109 1  BOY   38   04     1   02   absurd         8  
110 2  GIRL  12   14     2   12   outlandish     9 5
111 1  BOY   41   02     3   17   outrageous     105
112 1  BOY   52   10     1   04   bizzare        116
113 2  GIRL  09   10     1   03   incredible     127
`


In [None]:
dogsDf3_1 = pd.read_fwf(dataDirectory+"column2.dat", 
                      header=None, 
                      names=['Name', 
                             'Species', 
                             'Gender', 
                             'Weight', 
                             'Fur', 
                             'Temper', 
                             'Bites',
                             'Rating',
                             'Sequence'],
                     na_values=['M'])
dogsDf3_1

### OOPS!
Note that the Rating and Sequence columns were not read correctly. since they were not separated in the raw file. Here is the result when the columns are specified.

In this case the names and columns are placed together in a dict structure. This is a much safer way to specify them, being easier to proofread.  

The function positions.values() returns a dict_values object. A list of the values can be returned by list(positions.values()).

Similarly a list of the keys (the names in this case) can be returned by list(positions.keys()).

In [None]:
positions = {'Name':(0,3), 
             'Species':(4,5), 
             'Gender':(7,11), 
             'Weight':(13,15), 
             'Fur':(18,20), 
             'Temper':(25,26), 
             'Bites':(29,31),
             'Rating':(34,49),
             'Sequence':(49,51)}
dogsDf3 = pd.read_fwf(dataDirectory+"column2.dat", 
                      header=None, 
                      colspecs=list(positions.values()),
                      names=list(positions.keys()),
                     na_values=['M'])
dogsDf3

### Order and Rereading
In the code snippet below the order of the DafaFrame columns has been changed. While Name appears first in the raw data file, the DataFrame has Sequence at the beginning.

This is another advantage of using a dict structure for names and positions. The names and positions are reordered together, making it less likely to reorder just one of the two.

The first character of the Gender code has also been reread into a new column Gender1. 

In [None]:
positions = {'Sequence':(49,51),
             'Name':(0,3), 
             'Species':(4,5), 
             'Gender':(7,11), 
             'Gender1':(7,8),
             'Weight':(13,15), 
             'Fur':(18,20), 
             'Temper':(25,26), 
             'Bites':(29,31),
             'Rating':(34,49)
             }
dogsDf3reordered = pd.read_fwf(dataDirectory+"column2.dat", 
                      header=None, 
                      colspecs=list(positions.values()),
                      names=list(positions.keys()),
                     na_values=['M'])
dogsDf3reordered

### Contiguous Columns
Next we could create a new example column oriented dataset where there were no delimiters at all. All values are the same width and are contiguous.
This will use a DataFrame **values** property https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

the **enumerate()** function to get each list of values in a row https://docs.python.org/3/library/functions.html#enumerate

the **str()** function to force the data to a string  https://docs.python.org/3/library/functions.html#func-str

and the **ljust()** or **rjust**function of the string to fix the width of the string. Strings are left justified and everything else is right justified.
https://docs.python.org/3/library/stdtypes.html#str

Numeric missing values (NaN) are written out as "M", left justified to the column width.

### here is dogsDf3.values

In [None]:
dogsDf3.values

In [None]:
colWidth =  {'Name':3, 
             'Species':1, 
             'Gender':4, 
             'Weight':2, 
             'Fur':2, 
             'Temper':1, 
             'Bites':2,
             'Rating':15,
             'Sequence':2}
for valueList in dogsDf3.values:
    rowString = ''
    for ixValue,value in enumerate(valueList):
         if type(value) == str:
            valueString = str(value).ljust(list(colWidth.values())[ixValue])
         else:
            stringValue =  str(value) 
            if stringValue == "nan":
                valueString = "M".ljust(list(colWidth.values())[ixValue]," ")
            else:
                valueString = str(int(value)).rjust(list(colWidth.values())[ixValue],"0")
         rowString = rowString + valueString   
    print(rowString)        

### Metadata are Essential
Files like this were very common years ago. Storage space was expensive, so data files were usually made as compact as possible.
You might find data like this if you needed to read, for example, census files from the 1980s.

Note that a file like this is not useful without **metadata** that describes how to identify the variables in each row.

We could read such a file as follows:

In [None]:
colWidth =  {'Name':3, 
             'Species':1, 
             'Gender':4, 
             'Weight':2, 
             'Fur':2, 
             'Temper':1, 
             'Bites':2,
             'Rating':15,
             'Sequence':2}
dogsDf4 = pd.read_fwf(dataDirectory+"noDelimiters.dat", 
                      header=None, 
                      widths=list(colWidth.values()),
                      names=list(colWidth.keys()),
                      na_values=['nan'])
dogsDf4

### Comparisons
The following compares cells in the two tables. The iloc DataFrame attribute is used to locate cells by numeric indices.

In [None]:
print(  dogsDf3.iloc[1,7]  )
dogsDf3.iloc[1,7] == dogsDf4.iloc[1,7]

### Folded input lines
Very commonly in the past, due to restrictions on how many characters an 
input line might have, Dataset records were split up into multiple lines in the input file.

An example might be the following file where each row of the dataset is folded into two rows in the input file.
In this example the first line has 

id in positions 0-4 first line
age in positions 6-7 first line
gender in position 9 first line
heightInches in position 2-3 second line
weightPounds in position 5-7 second line

`
Case1 14 M
  65 125
Case2 20 F
  62 115
Case3 17 M
  71 160
Case4 18 M
  76 190
`

The Pandas read functions cannot deal with this layout.
One approach would be to unfold the lines - to read the file with one string per line and then concatenate the pairs of lines that go together.

The readline() function reads one line from the file and then positions the file object to read the next line.

In [None]:
# read this file and join the first two lines
foldedFile=open(dataDirectory+"FoldedLines.dat")

longLine = foldedFile.readline()[:-1] + foldedFile.readline()

foldedFile.close()
longLine

### the join function

In [None]:
# iterate through a string and insert a comma between each character
",".join("abc")

### Using the file iterator
The file object is an iterator of its lines. It can be read with a for loop.  When reading the file pairs of lines can be concatenated together.

In [None]:
# read this file
foldedFile=open(dataDirectory+"FoldedLines.dat")
# the list that will contain the concatenated lines
newStructure = []

for ixLine,line in enumerate(foldedFile):    
    if ixLine % 2 == 0:
        # the first, third ... line
        # delete the trailing newline
        firstPart = line[:-1]
    else:
        # the second, fourth ... line
        fullLine = firstPart + line[:-1]
        newStructure.append(fullLine)
newStructure

### Using a While loop
Another alternative is to use a *while* loop to explicitly loop through the file

In [None]:
# read this file
foldedFile=open(dataDirectory+"FoldedLines.dat")
# the list that will contain the concatenated lines
newStructure = []
#a list of the lines in one new line
newLineList=[]
# a counter for lines within each new longer line
# when we start on a new long line this becomes 0, the list beginning
lineIndex=-1
while True:
    #get a line without the trailing newline
    line = foldedFile.readline()[:-1]
    # stop the loop if we hit the end of the file
    if line == '':
        break
    # add it to the list
    newLineList = newLineList + [line]
    lineIndex = lineIndex + 1        
    # when newLineList is full, concatenate the lines in it 
    # and add them to the new structure
    if lineIndex == 1:
        combinedLine = "".join(newLineList)
        newStructure += [combinedLine]
        lineIndex = -1
        newLineList=[]

    
foldedFile.close()
newStructure

### make one string from the list
Separate each line with a newline character

In [None]:
bigString = '\n'.join(newStructure)
bigString

### use the StringIO function to read the string as if it were a file

In [None]:
import io
bigString = '\n'.join(newStructure)
pd.read_fwf(io.StringIO(bigString), 
                      header=None, 
                      names=['id', 'age', 'gender', 'heightInches','weightPounds'])