<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Introduction_to_Pandas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Acknowledgement source: The content in this tutorial is based on Practical AI/ML for Computational Biology and Chemistry Workshop (June 13-17, 2022, UD) https://github.com/udel-cbcb/al_ml_workshop

#What is Pandas?
[Pandas](https://pandas.pydata.org/) is used to perform operations on both tabular and non-tabular types of data intuitively. It supports different types of operations such as joins and merging, and loading data from multiple sources. Once your data is in Pandas, a number of useful tasks can be perform, such as sorting, filtering, and aggregating values; joining tables together; reshaping and resizing datasets, adding/removing missing values, etc.


In [1]:
#Installing pandas, already installed in Colab, but if you need here is the way to do it:
!pip install pandas



#Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.


##Series
You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) as follows

In [34]:
import pandas as pd
s = pd.Series([0.5, 0.75, 1.0, 1.25],
                 index=['a', 'b', 'c', 'd'])
s

a    0.50
b    0.75
c    1.00
d    1.25
dtype: float64

To retrieve the value 1.00 then:


In [32]:
import pandas as pd
s = pd.Series([0.5, 0.75, 1.0, 1.25],
                 index=['a', 'b', 'c', 'd'])
s['c']

1.0

You can create a [Series] by passing a list of values, letting pandas create a default integer index:

Note: Pandas primarily uses the value np.nan to represent missing data. The special value NaN (Not-A-Number) is used everywhere as the NA value. The advantage is that it can be stored with NumPy’s float64 dtype. It can be used across the dtypes to detect NA values. See also Missing data section

In [4]:
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

If you want to filter NA values, you can use notnull() as follows:

In [5]:
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s[s.notnull()]

0    1.0
1    3.0
2    5.0
4    6.0
5    8.0
dtype: float64

##DataFrame
You can create a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) (Two-dimensional, size-mutable, potentially heterogeneous tabular data) by passing a NumPy array, with a datetime index and labeled columns:



In [6]:
# Import pandas library
import pandas as pd

# initialize list elements
data = [10,20,30,40,50,60]

# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Numbers'])

# print dataframe.
df

Unnamed: 0,Numbers
0,10
1,20
2,30
3,40
4,50
5,60


### About pd.date_range
The pd.date_range function in Pandas is used to generate a sequence of dates within a specified range. It takes several parameters, but the most commonly used ones are:

    start: The date to start the sequence.
    end: The date to end the sequence.
    periods: The number of periods to generate.
    freq: The frequency of the dates in the sequence.

For example, if you want to generate a sequence of dates from March 6, 2022 to March 11, 2022, you could use:

pd.date_range(start='3/6/2022', end='3/11/2022')

Alternatively,
pd.date_range("20220306", periods=6)

This will return a DatetimeIndex with dates from March 6, 2022 to March, 2022, inclusive.

In [29]:
# Alternatively,Returns the range of equally spaced time points
dates = pd.date_range("20220306", periods=6)
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

The 'freq' parameter specifies the frequency of the dates in the index. 'D' stands for daily frequency, meaning the dates in the index are expected to increment by one day at a time.
DatetimeIndex can take a variety of string codes to specify different frequencies. Here are a few examples:

    'D' for calendar day frequency
    'B' for business day frequency
    'H' for hourly frequency
    'T' or 'min' for minutely frequency
    'S' for secondly frequency
    'L' or 'ms' for milliseconds
    'U' for microseconds
    'N' for nanoseconds
    'W' for weekly frequency

Exercise: wh

In [17]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2022-03-06,-1.103982,-1.360475,0.245014,-1.602586
2022-03-07,1.407966,-0.169064,0.626881,-0.117075
2022-03-08,-1.220822,-0.111309,-0.287005,-0.217841
2022-03-09,1.65911,2.177612,-1.481054,1.852778
2022-03-10,0.2745,0.85037,-1.028575,-1.648368
2022-03-11,-0.284459,0.792646,1.294548,1.298642


You can also create a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [10]:
#In the case below the DataFrame has columns of different types. Here's a brief explanation of each column in from dataFram below:
  #  "A": A column of float values, all set to 1.0.
  # "B": A column of timestamp values, all set to the date "2013-01-02".
  # "C": A column of float32 values, all set to 1. The index is a list of integers from 0 to 3.
  #"D": A column of int32 values, all set to 3.
  #"E": A column of categorical values with two categories: "test" and "train".
  #"F": A column of string values, all set to "foo".
#The resulting DataFrame df2 will have 4 rows (since the index ranges from 0 to 3) and 6 columns (A, B, C, D, E, F).

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes:

In [11]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

#About the datatypes above:

### When would you use float64 over float32?
Remember that the main difference between float32 and float64 data types lies in their precision and range of values they can represent.

float32 is a single-precision floating-point format that uses 32 bits. It has a precision of about 7 decimal digits and its range is approximately 1.18e-38 to 3.4e38.

On the other hand, float64 is a double-precision floating-point format that uses 64 bits. It has a precision of about 15 decimal digits and its range is approximately 2.23e-308 to 1.80e308.

So, if you need more precision or a larger range, you would use float64. However, float64 uses more memory and computational resources than float32, so if memory or speed is a concern and the extra precision or range is not needed, float32 might be a better choice.

### Dates
'datetime64[ns]', means the dates are stored as 64-bit numpy datetime objects with a precision of nanoseconds.



#Viewing data

See [Essential basic functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section of Pandas documentation for details.

You can view the top and bottom rows of the frame:

In [None]:
df.head(3) # first three rows

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932


In [12]:
df.tail(2) # last two rows

Unnamed: 0,A,B,C,D
2022-03-10,-1.118223,-1.039329,-1.824761,0.162934
2022-03-11,0.063615,0.346729,-0.024786,-0.219673


You can display the indexes and columns:

In [25]:
df.index

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [15]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

**describe**() shows a quick statistic summary of your data:

In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.042272,-0.148281,-0.57522,-0.137664
std,1.051699,0.94167,1.233369,0.560091
min,-1.404005,-1.520349,-1.972057,-1.199937
25%,-0.822764,-0.778851,-1.512951,-0.18161
50%,0.19123,0.174657,-0.497511,0.021032
75%,0.437249,0.35591,-0.122965,0.149571
max,1.409417,0.961708,1.365304,0.38863


Transposing your data:

In [18]:
df.T

Unnamed: 0,2022-03-06,2022-03-07,2022-03-08,2022-03-09,2022-03-10,2022-03-11
A,-1.103982,1.407966,-1.220822,1.65911,0.2745,-0.284459
B,-1.360475,-0.169064,-0.111309,2.177612,0.85037,0.792646
C,0.245014,0.626881,-0.287005,-1.481054,-1.028575,1.294548
D,-1.602586,-0.117075,-0.217841,1.852778,-1.648368,1.298642


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) by axis:

In [19]:
# The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
df.sort_index(axis=1, ascending=False) # Sort based on column label

Unnamed: 0,D,C,B,A
2022-03-06,-1.602586,0.245014,-1.360475,-1.103982
2022-03-07,-0.117075,0.626881,-0.169064,1.407966
2022-03-08,-0.217841,-0.287005,-0.111309,-1.220822
2022-03-09,1.852778,-1.481054,2.177612,1.65911
2022-03-10,-1.648368,-1.028575,0.85037,0.2745
2022-03-11,1.298642,1.294548,0.792646,-0.284459


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) by values:

In [20]:
df.sort_values(by="C") # Sort by 'C' column ascending

Unnamed: 0,A,B,C,D
2022-03-09,1.65911,2.177612,-1.481054,1.852778
2022-03-10,0.2745,0.85037,-1.028575,-1.648368
2022-03-08,-1.220822,-0.111309,-0.287005,-0.217841
2022-03-06,-1.103982,-1.360475,0.245014,-1.602586
2022-03-07,1.407966,-0.169064,0.626881,-0.117075
2022-03-11,-0.284459,0.792646,1.294548,1.298642


#Selection

See [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) section of Pandas documentation for details.

Selecting a single column, which yields a Series, equivalent to df.A:

In [21]:
df["A"] # Select 'A' column

2022-03-06   -1.103982
2022-03-07    1.407966
2022-03-08   -1.220822
2022-03-09    1.659110
2022-03-10    0.274500
2022-03-11   -0.284459
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:

In [22]:
df[0:4] # Select first 4 rows

Unnamed: 0,A,B,C,D
2022-03-06,-1.103982,-1.360475,0.245014,-1.602586
2022-03-07,1.407966,-0.169064,0.626881,-0.117075
2022-03-08,-1.220822,-0.111309,-0.287005,-0.217841
2022-03-09,1.65911,2.177612,-1.481054,1.852778


In [23]:
df["20220306":"20220310"] # Get "2022-03-06" through "2022-03-10" rows

Unnamed: 0,A,B,C,D
2022-03-06,-1.103982,-1.360475,0.245014,-1.602586
2022-03-07,1.407966,-0.169064,0.626881,-0.117075
2022-03-08,-1.220822,-0.111309,-0.287005,-0.217841
2022-03-09,1.65911,2.177612,-1.481054,1.852778
2022-03-10,0.2745,0.85037,-1.028575,-1.648368


##Selection by label

**loc** selects rows and columns with specific labels. **iloc** selects rows and columns at specific integer positions.

Getting a cross section using a label:

In [None]:
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [None]:
df.loc[dates[0]] # Get row indexed by '2022-03-06'

A    0.254345
B   -1.406071
C    0.428245
D    0.183281
Name: 2022-03-06 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [None]:
df.loc[:, ["A", "B"]] # Get 'A' and 'B' columns for all rows

Unnamed: 0,A,B
2022-03-06,0.254345,-1.406071
2022-03-07,-0.888933,-0.378572
2022-03-08,0.235762,-1.73866
2022-03-09,0.73083,1.02758
2022-03-10,1.010398,-0.217928
2022-03-11,0.518823,0.426742


Showing label slicing, both endpoints are included:

In [None]:
df.loc["20220307":"20220309", ["A", "B"]] # # Get 'A' and 'B' columns for rows indexed by '2022-03-07' through '2022-03-09'

Unnamed: 0,A,B
2022-03-07,-0.888933,-0.378572
2022-03-08,0.235762,-1.73866
2022-03-09,0.73083,1.02758


Reduction in the dimensions of the returned object:

In [None]:
df.loc["20220308", ["A", "B"]] # Get 'A' and 'B' columns of '2022-03-08' row

A    0.235762
B   -1.738660
Name: 2022-03-08 00:00:00, dtype: float64

Getting a scalar value:

In [None]:
df.loc[dates[0], "A"] # Get value at dates[0] row and 'A' column.

0.25434471232921885

Getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.at[dates[0], "A"]

0.25434471232921885

##Selection by position
Selecting via the position of the passed integers:

In [None]:
df.iloc[2] # Get all values of third row

A    0.235762
B   -1.738660
C    1.394100
D   -1.394932
Name: 2022-03-08 00:00:00, dtype: float64

By integer slices, similar to NumPy/Python:

In [None]:
df

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


In [None]:
df.iloc[3:5, 0:2] # Get values of 4 and 5 rows, 'A', 'B' columns

Unnamed: 0,A,B
2022-03-09,0.73083,1.02758
2022-03-10,1.010398,-0.217928


By lists of integer position locations, similar to the NumPy/Python style:

In [None]:
df.iloc[[1, 2, 4], [0, 2]] # Get values of 2, 3, 5 rows, 'A', 'C' columns

Unnamed: 0,A,C
2022-03-07,-0.888933,0.577769
2022-03-08,0.235762,1.3941
2022-03-10,1.010398,-0.401978


Slicing rows explicitly:

In [None]:
df.iloc[1:3, :] # Get values of 2 and 3 rows

Unnamed: 0,A,B,C,D
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932


Slicing columns explicitly:

In [None]:
df.iloc[:, 1:3] # Get values of 2 and 3 columns

Unnamed: 0,B,C
2022-03-06,-1.406071,0.428245
2022-03-07,-0.378572,0.577769
2022-03-08,-1.73866,1.3941
2022-03-09,1.02758,0.114515
2022-03-10,-0.217928,-0.401978
2022-03-11,0.426742,0.728492


 Getting a value explicitly:

In [None]:
df.iloc[1, 1] # Get value at 2nd row and 2nd columns

-0.37857209100663536

Getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.iat[1, 1]

-0.37857209100663536

##Boolean indexing
Using a single column’s values to select data:

In [None]:
df[df["A"] > 0] # Get rows where 'A' columns is greater than 0

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


Using the isin() method for filtering:

In [None]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"] # Add new column 'E'
df2

Unnamed: 0,A,B,C,D,E
2022-03-06,0.254345,-1.406071,0.428245,0.183281,one
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,one
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,two
2022-03-09,0.73083,1.02758,0.114515,-0.582374,three
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,four
2022-03-11,0.518823,0.426742,0.728492,-1.556056,three


In [None]:
df2[df2["E"].isin(["two", "four"])] # Get rows where the values in column 'E' is either "two" or "four"

Unnamed: 0,A,B,C,D,E
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,two
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,four


##Setting
Setting a new column automatically aligns the data by the indexes:

In [None]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220306", periods=6))
s1

2022-03-06    1
2022-03-07    2
2022-03-08    3
2022-03-09    4
2022-03-10    5
2022-03-11    6
Freq: D, dtype: int64

In [None]:
df["F"] = s1
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.254345,-1.406071,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting values by label:

In [None]:
df.at[dates[0], "A"] = 0 # Set the value at dates[0] and 'A' column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,-1.406071,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting values by position:

In [None]:
df.iat[0, 1] = 0 # Set the value at first row, second column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting by assigning with a NumPy array:

In [None]:
df.loc[:, "D"] = np.array([5] * len(df)) # Set the values at 'D' columns to be 5
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,0.577769,5,2
2022-03-08,0.235762,-1.73866,1.3941,5,3
2022-03-09,0.73083,1.02758,0.114515,5,4
2022-03-10,1.010398,-0.217928,-0.401978,5,5
2022-03-11,0.518823,0.426742,0.728492,5,6


#Missing data

See [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section of Pandas documentation for details.

Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.





In [None]:
# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0
2022-03-08,0.235762,-1.73866,1.3941,5,3,
2022-03-09,0.73083,1.02758,0.114515,5,4,


To drop any rows that have missing data:

In [None]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0


Filling missing data:

In [None]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0
2022-03-08,0.235762,-1.73866,1.3941,5,3,5.0
2022-03-09,0.73083,1.02758,0.114515,5,4,5.0


To get the boolean mask where values are *NaN*:

In [None]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,False,False,False,False,False,False
2022-03-07,False,False,False,False,False,False
2022-03-08,False,False,False,False,False,True
2022-03-09,False,False,False,False,False,True


#Operations

See the [Flexible binary operations](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop) section of Pandas documentation for details.

##Stats
Operations in general exclude missing data.
Performing a descriptive statistic:

In [None]:
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,0.577769,5,2
2022-03-08,0.235762,-1.73866,1.3941,5,3
2022-03-09,0.73083,1.02758,0.114515,5,4
2022-03-10,1.010398,-0.217928,-0.401978,5,5
2022-03-11,0.518823,0.426742,0.728492,5,6


In [None]:
df.max(axis=0) # Get max of all columns

A    1.010398
B    1.027580
C    1.394100
D    5.000000
F    6.000000
dtype: float64

Same operation on the other axis:

In [None]:
df.max(axis=1) # Get max of all rows

2022-03-06    5.0
2022-03-07    5.0
2022-03-08    5.0
2022-03-09    5.0
2022-03-10    5.0
2022-03-11    6.0
Freq: D, dtype: float64

##Apply

Applying functions to the data:

In [None]:
df.apply(np.cumsum) #Return cumulative sum over a DataFrame or Series axis.

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,1.006014,10,3
2022-03-08,-0.65317,-2.117232,2.400114,15,6
2022-03-09,0.07766,-1.089652,2.514629,20,10
2022-03-10,1.088057,-1.30758,2.112651,25,15
2022-03-11,1.60688,-0.880838,2.841143,30,21


In [None]:
df.apply(lambda x: x.max() - x.min()) # Get the min-max differenes of columns

A    1.899331
B    2.766240
C    1.796079
D    0.000000
F    5.000000
dtype: float64

##String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

In [None]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#Getting data in/out
##CSV
Writing to a csv file:

In [None]:
df.to_csv("foo.csv")

Reading from a csv file:

In [None]:
pd.read_csv("foo.csv")

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. If you want to know more about HDF5 format, please see [What is HDF5](https://support.hdfgroup.org/HDF5/whatishdf5.html) for details.

Reading and writing to HDFStores.



In [None]:
df.to_hdf("foo.h5", "df")

Reading from a HDF5 Store:

In [None]:
pd.read_hdf("foo.h5", "df")

Unnamed: 0,A,B,C,D
2000-01-01,0.492381,-0.201196,0.368564,-0.054544
2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2000-01-03,1.696482,-1.546123,0.040211,-0.767437
2000-01-04,2.940826,-2.272474,0.547894,-1.399211
2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...
2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##Excel

Reading and writing to MS Excel.

Writing to an excel file:

In [None]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file:

In [None]:
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036
