# Using pandas Series

A Series is a 1D object containing an array of any NumPy data type. The key difference is that a Series object has an associated array of data labels, called the **index**.

If we do not specify the index, numerical indexes are used like for NumPy arrays.

In [2]:
import numpy as np
import pandas as pd

# Create the array
array = np.random.randint(1, 100, 10)
#print array
#print type(array)


#Create Series from array
series = pd.Series(array)
print series
print type(series)
print type(series.values)
print type(series.index)
# Observe that the printing is different!

0    53
1    35
2    69
3     3
4    73
5    63
6    98
7    85
8    60
9    74
dtype: int32
<class 'pandas.core.series.Series'>
<type 'numpy.ndarray'>
<class 'pandas.indexes.range.RangeIndex'>


## Indexes in series

The index is a different array, and it can be changed! (or you can provide it when you create the series)

In [3]:
print(series.index)
print type(series.index)

# Change the index:
series.index=[list("abcdefghij")]
print series
print type(series.index)

RangeIndex(start=0, stop=10, step=1)
<class 'pandas.indexes.range.RangeIndex'>
a    53
b    35
c    69
d     3
e    73
f    63
g    98
h    85
i    60
j    74
dtype: int32
<class 'pandas.indexes.base.Index'>


**Reflection:** then a Series can be thought of as a **ordered** dictionary with fixed keys. Actually, it behaves as a Python dict in some situations, and you can create series from Python dicts.

In [4]:
print "a" in series
print "x" in series
print series["c"]

True
False
69


You can select using values in the index

In [5]:
series[["a","c","f"]]

a    53
c    69
f    63
dtype: int32

## Series use indexes in a smart way

Let's create two series with different indexes.

In [6]:
import pandas as pd
import numpy as np
sales1Q = pd.Series(np.random.randint(20,100, 4), 
                    index=["Europe", "USA", "LATAM", "Asia"])
sales1Q.index.name = "regions"
print sales1Q

print "--------"

sales2Q = pd.Series(np.random.randint(20,100, 3), 
                    index=["Europe", "USA", "Asia"])
print sales2Q

regions
Europe    89
USA       44
LATAM     32
Asia      35
dtype: int32
--------
Europe    48
USA       99
Asia      40
dtype: int32


Now let's try to combine them.

In [7]:
sales1Q + sales2Q

Asia       75.0
Europe    137.0
LATAM       NaN
USA       143.0
dtype: float64

It took the labels at the indexes to add the values.
But the problem is that "LATAM" was not in the second index, and then it was assumed it had no value, resulting in a NaN. We must care of missing values and replace them accordingly or remove the corresponding data depending on our data.

In [8]:
sales1Q.add(sales2Q, fill_value=0)

Asia       75.0
Europe    137.0
LATAM      32.0
USA       143.0
dtype: float64

## Manipulating Series

You can use dates to index a Series

In [9]:
dates = pd.date_range('20130125',periods=10)

values = pd.Series(np.random.randint(1,10, 10),index=dates)
print values
values.tail(4)

2013-01-25    6
2013-01-26    3
2013-01-27    7
2013-01-28    7
2013-01-29    2
2013-01-30    4
2013-01-31    9
2013-02-01    1
2013-02-02    9
2013-02-03    5
Freq: D, dtype: int32


2013-01-31    9
2013-02-01    1
2013-02-02    9
2013-02-03    5
Freq: D, dtype: int32

In [10]:

# We can select based on comparing the indexes
from datetime import datetime
end_january = datetime(2013, 01, 31)
february = values[values.index > end_january]
print february

print "-------"

# And also comparing the values
higher_than_five = values[values >5]
print higher_than_five

2013-02-01    1
2013-02-02    9
2013-02-03    5
Freq: D, dtype: int32
-------
2013-01-25    6
2013-01-27    7
2013-01-28    7
2013-01-31    9
2013-02-02    9
dtype: int32


## Series of strings can also be manipulated

In [11]:
names = pd.Series(["John Smith", "Bob Geldof", "Ron Paul", "Frank Miller"])
names = names.str.split(expand=True)
names.columns = ["nombre", "apellido"]
names.sort_values(by=["apellido"])

Unnamed: 0,nombre,apellido
1,Bob,Geldof
3,Frank,Miller
2,Ron,Paul
0,John,Smith


## Series do stats too

And there are a number of statistical functions on series also.

In [12]:
s1 = pd.Series(np.random.randn(1000))

s2 = pd.Series(np.random.randn(1000))

print s1.corr(s2)
print s1.cov(s2)


0.0260535385974
0.0267996798631


In [13]:
s1 = pd.Series(np.random.randn(1000))

s2 = pd.Series(s1 + np.random.randn(1000)/10)

print s1.corr(s2)
print s1.cov(s2)

0.995012118582
0.983855244307


More on statistical functions: http://pandas.pydata.org/pandas-docs/dev/computation.html

<code>f_oneway</code> tests the null hypothesis that two or more groups have the same population mean. The hypothesis is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily pre-defined threshold value $\alpha$, which is referred to as the level of significance.

However, it has some assumptions: http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.stats.f_oneway.html   Do they hold in our case? 

In [14]:
from scipy.stats import ttest_1samp
t, p = ttest_1samp(s1, 0)
# If p is large, we can accept that the population has the mean indicated.
print p

0.856138531387
