## Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

    genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Let's start with some more bacteria data, stored in csv format.

In [1]:
import pandas as pd
import numpy as np

In [3]:
!head data/microbiome.csv

Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80


This table can be read into a DataFrame using `read_csv`:

In [4]:
mb = pd.read_csv("data/microbiome.csv")
mb.head(10)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
5,Firmicutes,6,693,50
6,Firmicutes,7,718,717
7,Firmicutes,8,173,33
8,Firmicutes,9,228,80
9,Firmicutes,10,162,3196


In [5]:
mb.shape[1]

4

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [6]:
l = pd.read_csv("data/microbiome.csv", header=None).head()
l

Unnamed: 0,0,1,2,3
0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


In [7]:
l.loc[0, :]

0      Taxon
1    Patient
2     Tissue
3      Stool
Name: 0, dtype: object

In [8]:
l.columns = l.loc[0, :]
l.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


In [9]:
l.drop([0], axis=0)

Unnamed: 0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


In [10]:
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Taxon,Patient,Tissue,Stool
1,Firmicutes,1,632,305
2,Firmicutes,2,136,4182
3,Firmicutes,3,1174,703
4,Firmicutes,4,408,3946


`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [11]:
mb = pd.read_table("data/microbiome.csv", sep=',')
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


The `sep` argument can be customized as needed to accommodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.

In [12]:
mb = pd.read_csv("data/microbiome.csv", index_col=['Taxon','Patient'])
mb.head(35)

Unnamed: 0_level_0,Unnamed: 1_level_0,Tissue,Stool
Taxon,Patient,Unnamed: 2_level_1,Unnamed: 3_level_1
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80
Firmicutes,10,162,3196


This is called a *hierarchical* index, which we will revisit later.

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument:

In [13]:
import random

In [14]:
mb.shape

(75, 2)

In [15]:
pd.read_csv("data/microbiome.csv", sep=',' , skiprows=3).head(75)

Unnamed: 0,Firmicutes,3,1174,703
0,Firmicutes,4,408,3946
1,Firmicutes,5,831,8605
2,Firmicutes,6,693,50
3,Firmicutes,7,718,717
4,Firmicutes,8,173,33
...,...,...,...,...
67,Other,11,203,6
68,Other,12,392,6
69,Other,13,28,25
70,Other,14,12,22


In [16]:
pd.read_csv("data/microbiome.csv", sep=',').head(10)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
5,Firmicutes,6,693,50
6,Firmicutes,7,718,717
7,Firmicutes,8,173,33
8,Firmicutes,9,228,80
9,Firmicutes,10,162,3196


In [18]:
pd.read_csv("data/microbiome.csv", sep=',' , skiprows=[1,2,3,5]).head(75)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,4,408,3946
1,Firmicutes,6,693,50
2,Firmicutes,7,718,717
3,Firmicutes,8,173,33
4,Firmicutes,9,228,80
...,...,...,...,...
66,Other,11,203,6
67,Other,12,392,6
68,Other,13,28,25
69,Other,14,12,22


In [14]:
pd.read_csv("data/microbiome.csv", sep=',' , skiprows=random.sample(range(0, mb.shape[0]), 25)).head()

Unnamed: 0,Firmicutes,1,632,305
0,Firmicutes,2,136,4182
1,Firmicutes,3,1174,703
2,Firmicutes,5,831,8605
3,Firmicutes,8,173,33
4,Firmicutes,9,228,80


In [20]:
pd.read_csv("data/microbiome.csv", sep=',', skiprows=lambda x: x < 75 and pd.read_csv("data/microbiome.csv").iloc[x, 0].startswith('F'))

Unnamed: 0,Firmicutes,15,281,2377
0,Proteobacteria,1,1638,3886
1,Proteobacteria,2,2469,1821
2,Proteobacteria,3,839,661
3,Proteobacteria,4,4414,18
4,Proteobacteria,5,12044,83
5,Proteobacteria,6,2310,12
6,Proteobacteria,7,3053,547
7,Proteobacteria,8,395,2174
8,Proteobacteria,9,2651,767
9,Proteobacteria,10,1195,76


In [21]:
# Alternative to previous code

# Read the CSV file once
df = pd.read_csv("data/microbiome.csv", sep=',')

# Filter rows based on the condition
filtered_df = df[~((df.index < 75) & (df.iloc[:, 0].str.startswith('F')))].reset_index()

# Display the number of rows in the filtered DataFrame
filtered_df

Unnamed: 0,index,Taxon,Patient,Tissue,Stool
0,15,Proteobacteria,1,1638,3886
1,16,Proteobacteria,2,2469,1821
2,17,Proteobacteria,3,839,661
3,18,Proteobacteria,4,4414,18
4,19,Proteobacteria,5,12044,83
5,20,Proteobacteria,6,2310,12
6,21,Proteobacteria,7,3053,547
7,22,Proteobacteria,8,395,2174
8,23,Proteobacteria,9,2651,767
9,24,Proteobacteria,10,1195,76


Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [19]:
pd.read_csv("data/microbiome.csv", nrows=10)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
5,Firmicutes,6,693,50
6,Firmicutes,7,718,717
7,Firmicutes,8,173,33
8,Firmicutes,9,228,80
9,Firmicutes,10,162,3196


Alternately, if we want to process our data in reasonable chunks, the `chunksize` argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:

In [16]:
data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)
data_chunks

<pandas.io.parsers.readers.TextFileReader at 0x7250156d53d0>

In [17]:
for chunk in data_chunks:
    print(chunk)

         Taxon  Patient  Tissue  Stool
0   Firmicutes        1     632    305
1   Firmicutes        2     136   4182
2   Firmicutes        3    1174    703
3   Firmicutes        4     408   3946
4   Firmicutes        5     831   8605
5   Firmicutes        6     693     50
6   Firmicutes        7     718    717
7   Firmicutes        8     173     33
8   Firmicutes        9     228     80
9   Firmicutes       10     162   3196
10  Firmicutes       11     372     32
11  Firmicutes       12    4255   4361
12  Firmicutes       13     107   1667
13  Firmicutes       14      96    223
14  Firmicutes       15     281   2377
             Taxon  Patient  Tissue  Stool
15  Proteobacteria        1    1638   3886
16  Proteobacteria        2    2469   1821
17  Proteobacteria        3     839    661
18  Proteobacteria        4    4414     18
19  Proteobacteria        5   12044     83
20  Proteobacteria        6    2310     12
21  Proteobacteria        7    3053    547
22  Proteobacteria        8     

In [20]:
#You need to read file with chunks option before iterate the chunks object, 
#because the iterator is exhausted after the first complete iteration, and thus, 
#no more data is available in subsequent iterations.

data_chunks = pd.read_csv("data/microbiome.csv", chunksize=15)

mean_tissue = {chunk["Taxon"].iloc[0]:chunk["Tissue"].mean() for chunk in data_chunks}  
mean_tissue

{'Firmicutes': 684.4,
 'Proteobacteria': 2943.0666666666666,
 'Actinobacteria': 449.06666666666666,
 'Bacteroidetes': 599.6666666666666,
 'Other': 198.8}

In [21]:
l = pd.read_csv("data/microbiome.csv")
l.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


In [22]:
new_row = l.iloc[0].copy()
l.loc[len(l)] = new_row
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


In [24]:
new_row = l.iloc[0].copy()
l.loc[len(l)] = 65
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
73,Other,14,12,22
74,Other,15,305,32
75,Firmicutes,1,632,305
76,65,65,65,65


In [25]:
#By default, it keeps the first occurrence of each duplicate row and removes subsequent occurrences.
l.drop_duplicates(inplace = True)
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


In [26]:
new_row = l.iloc[1].copy()
l = l._append(new_row, ignore_index=True)
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32
75,65,65,65,65


In [60]:
#pd.__version__

In [27]:
new_row = l.iloc[1].copy()
l1 = l._append(new_row)
l1

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
73,Other,14,12,22
74,Other,15,305,32
75,65,65,65,65
76,Firmicutes,2,136,4182


In [28]:
l.drop_duplicates()
l

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32
75,65,65,65,65


In [29]:
l[l.duplicated()!=True]

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


In [30]:
l[~l.duplicated()]

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
71,Other,12,392,6
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32


In [31]:
l[l.duplicated()]

Unnamed: 0,Taxon,Patient,Tissue,Stool
76,Firmicutes,2,136,4182


In [32]:
l[l.notnull()]

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605
...,...,...,...,...
72,Other,13,28,25
73,Other,14,12,22
74,Other,15,305,32
75,65,65,65,65


Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`. In pandas, both "NA" and "NULL" are recognized as missing data indicators, and they are automatically converted to NaN (Not a Number) values when data is loaded into a DataFrame.

In [33]:
!head -10 data/microbiome_missing.csv

Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA


In [34]:
df_m = pd.read_csv("data/microbiome_missing.csv").head(20)
df_m

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305.0
1,Firmicutes,2,136,4182.0
2,Firmicutes,3,,703.0
3,Firmicutes,4,408,3946.0
4,Firmicutes,5,831,8605.0
5,Firmicutes,6,693,50.0
6,Firmicutes,7,718,717.0
7,Firmicutes,8,173,33.0
8,Firmicutes,9,228,
9,Firmicutes,10,162,3196.0


Above, Pandas recognized `NA` and an empty field as missing data.

In [47]:
pd.isnull(pd.read_csv("data/microbiome_missing.csv")).head(10)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,False,False,False,False
1,False,False,False,False
2,False,False,True,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,True
9,False,False,False,False


Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the `na_values` argument:
   

In [35]:
df_m = pd.read_csv("data/microbiome_missing.csv", na_values=['?', -99999])
df_m.head(20)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632.0,305.0
1,Firmicutes,2,136.0,4182.0
2,Firmicutes,3,,703.0
3,Firmicutes,4,408.0,3946.0
4,Firmicutes,5,831.0,8605.0
5,Firmicutes,6,693.0,50.0
6,Firmicutes,7,718.0,717.0
7,Firmicutes,8,173.0,33.0
8,Firmicutes,9,228.0,
9,Firmicutes,10,162.0,3196.0


In [36]:
df_m[df_m.duplicated()]

Unnamed: 0,Taxon,Patient,Tissue,Stool


In [37]:
df_m.notnull()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,True,True,True,True
1,True,True,True,True
2,True,True,False,True
3,True,True,True,True
4,True,True,True,True
...,...,...,...,...
70,True,True,True,True
71,True,True,True,True
72,True,True,True,True
73,True,True,True,True


In [39]:
#The expression df_m[df_m.notnull()] retains the structure of the DataFrame but replaces null values with NaN.
df_m[df_m.notnull()].head(20)

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632.0,305.0
1,Firmicutes,2,136.0,4182.0
2,Firmicutes,3,,703.0
3,Firmicutes,4,408.0,3946.0
4,Firmicutes,5,831.0,8605.0
5,Firmicutes,6,693.0,50.0
6,Firmicutes,7,718.0,717.0
7,Firmicutes,8,173.0,33.0
8,Firmicutes,9,228.0,
9,Firmicutes,10,162.0,3196.0


In [40]:
result = df_m.dropna().reset_index()
result

Unnamed: 0,index,Taxon,Patient,Tissue,Stool
0,0,Firmicutes,1,632.0,305.0
1,1,Firmicutes,2,136.0,4182.0
2,3,Firmicutes,4,408.0,3946.0
3,4,Firmicutes,5,831.0,8605.0
4,5,Firmicutes,6,693.0,50.0
...,...,...,...,...,...
66,70,Other,11,203.0,6.0
67,71,Other,12,392.0,6.0
68,72,Other,13,28.0,25.0
69,73,Other,14,12.0,22.0


### Microsoft Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), Pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: `xlrd` and `openpyxl` (these may be installed with either `pip` or `easy_install`).

Importing Excel data to Pandas is a two-step process. First, we create an `ExcelFile` object using the path of the file:                                             

In [50]:
#!pip install xlrd

In [41]:
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name: 'str | int | list[IntStrT] | None' = 0, *, header: 'int | Sequence[int] | None' = 0, names: 'list[str] | None' = None, index_col: 'int | Sequence[int] | None' = None, usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None, dtype: 'DtypeArg | None' = None, engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb'] | None" = None, converters: 'dict[str, Callable] | dict[int, Callable] | None' = None, true_values: 'Iterable[Hashable] | None' = None, false_values: 'Iterable[Hashable] | None' = None, skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, parse_dates: 'list | dict | bool' = False, date_parser: 'Callable | lib.NoDefault' = <no_default>, date_format: 'dict[Hashable, str] | str | None' = None, thousands: 'str | None' 

In [42]:
import pandas as pd
mb_file = pd.read_excel('data/microbiome/MID1.xls', sheet_name='Sheet 1')
mb_file

Unnamed: 0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3
2,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3
3,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",1
...,...,...
266,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",9
267,"Bacteria ""Verrucomicrobia"" Opitutae Opitutales...",1
268,Bacteria Cyanobacteria Cyanobacteria Chloropl...,2
269,Bacteria Cyanobacteria Cyanobacteria Chloropl...,85


In [43]:
mb_file.shape

(271, 2)

Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:

There is now a `read_excel` conveneince function in Pandas that combines these steps into a single call:

In [44]:
mb2 = pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None)
mb2.head()

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2


In [45]:
mb2.shape

(288, 2)

In [46]:
pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None, sep = ',')

TypeError: read_excel() got an unexpected keyword argument 'sep'

In [47]:
mb2_index = pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None, index_col=1)
mb2_index

Unnamed: 0_level_0,0
1,Unnamed: 1_level_1
2,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba..."
14,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba..."
23,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro..."
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro..."
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro..."
...,...
15,"Bacteria ""Thermotogae"" Thermotogae Thermotogal..."
22,"Bacteria ""Thermotogae"" Thermotogae Thermotogal..."
1,Bacteria Cyanobacteria Cyanobacteria Chloropl...
2,Bacteria Cyanobacteria Cyanobacteria Chloropl...


In [48]:
pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None, skiprows=1)

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
...,...,...
282,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",15
283,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",22
284,Bacteria Cyanobacteria Cyanobacteria Chloropl...,1
285,Bacteria Cyanobacteria Cyanobacteria Chloropl...,2


In [49]:
pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None, nrows=25)

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
5,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
6,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
7,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",10
8,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",11
9,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",9


In [50]:
mb_test = pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None).head(25)
mb_test

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
5,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
6,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
7,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",10
8,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",11
9,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",9


In [51]:
mb_chunk = pd.read_excel('data/microbiome/MID2.xls', sheet_name='Sheet 1', header=None, chunksize=100)

TypeError: read_excel() got an unexpected keyword argument 'chunksize'

In [52]:
new_row = mb2.iloc[0].copy()
mb2.loc[len(mb2)] = new_row
mb2

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
...,...,...
284,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",22
285,Bacteria Cyanobacteria Cyanobacteria Chloropl...,1
286,Bacteria Cyanobacteria Cyanobacteria Chloropl...,2
287,Bacteria TM7 TM7_genera_incertae_sedis,2


In [53]:
mb2.drop_duplicates(inplace = True)
mb2

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
...,...,...
283,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",15
284,"Bacteria ""Thermotogae"" Thermotogae Thermotogal...",22
285,Bacteria Cyanobacteria Cyanobacteria Chloropl...,1
286,Bacteria Cyanobacteria Cyanobacteria Chloropl...,2
