# NumPy's and Pandas

- Our next lecture will go over the most commonly used packages for handling data.
- Parts of this lecture were adapted from *VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016.*
- **If you have any questions over the course of this lecture, please post them to the 'Day 2 Lecture Questions' assignment on the Canvas course page.**

## What is NumPy: an alternative to lists

- NumPy objects are arrays which are multidimensional objects which analogous to the Python list.
- NumPy stands for Numerical Python and, as it sounds, is used for copmutation in array objects.
- There are many advantages to using NumPy over lists:
    - NumPy can read and write array data.
    - You can do quick math functions without all the for loops.
    - You can do linear algebra and create random numbers.


In [113]:
import numpy as np

an_array = np.arange(1000) # this is numpys version of range

#print(an_array)

In [114]:
a_list = list(range(1000))

#print(a_list)

# we can compare how quickly each object performs tasts

## Speed test: Lists vs. NumPy

In [115]:
%%time
for i in an_array: 
    an_array2 = an_array * 2

#6.98 milliseconds to complete

Wall time: 7.98 ms


In [116]:
%%time
for i in a_list: 
    a_list2 = [x * 2 for x in a_list]

#118 milliseconds to complete

Wall time: 147 ms


In [117]:
a_nump = np.array([[1,2,3],[4,5,6]])
a_nump

array([[1, 2, 3],
       [4, 5, 6]])

In [118]:
a_nother = np.array([['one','two','three'],['four','five','six']])
a_nother

array([['one', 'two', 'three'],
       ['four', 'five', 'six']], dtype='<U5')

## NumPy descriptives


- NumPy arrays are 3 dimensional so it is like have a list of lists (like we can have a dictionary of dictionaries).

- There are a lot of functionalities of NumPy.
- Here are some methods of NumPy which describe your data

    - **npdata.ndim**: the number of axes (dimensions) of the array.

    - **npdata.shape**: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.

    - **npdata.size**: the total number of elements of the array. This is equal to the product of the elements of shape.

    - **npdata.dtype**: an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.



## NumPy attributes and methods

- And here are a list of useful functions:


| Operation |      Function call|           Description |
|-----------|-------------------|------------------------:|
| +         |   npdata.add          |   Addition               |
| -         |   npdata.subtract     |   Subtraction            |
| -         |   npdata.negative     |   Negation               |
|*          |   npdata.multiply     |   Multiplication         |
|/          |   npdata.divide       |   Division               |
| //        |   npdata.floor_divide |   Floor Division         |
| `**`      |   npdata.power        |   Exponentiation         |
|%          |   npdata.mod          |   Modulus/Remainder      |
|[0-x]      |   npdata.arange       |   Add a range of numbers |
| \|x\|       |   np.absolute     |   Absolute value         |

In [119]:
#np.<TAB>

In [120]:
import numpy as np
arr = np.array([[ 1, 2, 3], [4, 5, 6]]) # this is a 2x3
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [121]:
arr.shape

(2, 3)

In [122]:
# NumPy add

1 + arr

array([[2, 3, 4],
       [5, 6, 7]])

In [123]:
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [124]:
# NumPy subtract

3 - arr

array([[ 2,  1,  0],
       [-1, -2, -3]])

In [125]:
# NumPy Negate

- arr

array([[-1, -2, -3],
       [-4, -5, -6]])

In [126]:
# NumPy Multiply

20* arr

array([[ 20,  40,  60],
       [ 80, 100, 120]])

In [127]:
# NumPy Divide

2/arr

array([[2.        , 1.        , 0.66666667],
       [0.5       , 0.4       , 0.33333333]])

## More NumPy features

- There are lots of convenient features of NumPy:

    - Compare arrays.
       
    - Indexing.

    - Built-in mean function.

    - Random number generator.

    - Sort function 

    - Read/write NumPy files.

    - Linear algebra.
- Like Python types, NumPy comes with a wide variety of attributes. 

    -You can view these by typing `np.<TAB>`

In [128]:
#make a new array
arr2 = np.array([[ 0, 4, 1], [7, 2, 12]])
arr2

array([[ 0,  4,  1],
       [ 7,  2, 12]])

In [129]:
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [130]:
arr2

array([[ 0,  4,  1],
       [ 7,  2, 12]])

In [131]:
# sum numpys

np.add(arr,arr2)

array([[ 1,  6,  4],
       [11,  7, 18]])

In [132]:
# add two numpys
# element wise
arr+arr2

array([[ 1,  6,  4],
       [11,  7, 18]])

## NumPy indexing

In [133]:
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [134]:
#index array

arr[0]

array([1, 2, 3])

In [135]:
#index array

arr[0][0]

1

In [136]:
# a 3-dimensional array

threeD = np.random.randint(10, size=(3, 4, 5))  #3x4x5

threeD


array([[[4, 5, 8, 4, 4],
        [9, 9, 7, 6, 8],
        [7, 6, 9, 1, 5],
        [5, 9, 9, 7, 2]],

       [[6, 6, 9, 5, 6],
        [8, 1, 8, 4, 2],
        [0, 4, 1, 3, 7],
        [4, 2, 7, 8, 1]],

       [[4, 2, 2, 5, 5],
        [8, 2, 8, 3, 0],
        [5, 0, 6, 8, 6],
        [0, 7, 1, 5, 9]]])

In [137]:
threeD.ndim

3

In [138]:
# indexing a 3d array
threeD[0]

array([[4, 5, 8, 4, 4],
       [9, 9, 7, 6, 8],
       [7, 6, 9, 1, 5],
       [5, 9, 9, 7, 2]])

In [139]:
threeD[0][0]

array([4, 5, 8, 4, 4])

In [140]:
threeD[0][0][0]

4

## More NumPy features

In [141]:
print(arr)
print(arr2)

[[1 2 3]
 [4 5 6]]
[[ 0  4  1]
 [ 7  2 12]]


In [142]:
# compare numpy arrays

arr > arr2

array([[ True, False,  True],
       [False,  True, False]])

In [143]:
type(threeD)

numpy.ndarray

In [144]:
# find the mean

threeD[0].mean()


6.2

In [145]:
# the mean for all levels

threeD.mean()

5.033333333333333

In [146]:
# create random numbers

arr3 = np.random.randint(low=1, high=100, size=4) # you can also choose the distribution of random numbers np.random.normal(size=x)
print(arr3)

[87 22 72 50]


In [147]:
# built-in sort function that automatically updates object

arr3.sort()
print(arr3)

[22 50 72 87]


In [148]:
# Save a NumPy object
np.save('some_array', arr)


In [149]:
# Load a NumPy object

np.load('data\dem_load.npy')

array([['this', 'is', 'how'],
       ['to', 'load', 'numpys']], dtype='<U6')

In [150]:
# linear algebra
x = np.array([[ 1, 2, 3], [4, 5, 6]])
y = np.array([[ 6, 23], [-1, 7], [8, 9]])

x.dot( y)



array([[ 28,  64],
       [ 67, 181]])

## Pandas dataframes

- NumPy's are good to know, but the most useful package for data scientists will be pandas.
- If you had only one column of data it would be known as a panda series.
- However, most often we have a dataframe with multiple columns.

### Creating Data

In [151]:
# a series
import pandas as pd

a_series = pd.Series([ 4, 7, -5, 3])
print(a_series)


0    4
1    7
2   -5
3    3
dtype: int64


In [152]:
# a list of lists
names = [
    ['Dominique','Lockett','M'],
    ['Jordan','Mroz','M'],
    ['Jesse','Woollems','J']
    ]

print(names)

[['Dominique', 'Lockett', 'M'], ['Jordan', 'Mroz', 'M'], ['Jesse', 'Woollems', 'J']]


In [153]:
# turn it into a panda
pd.DataFrame(names)

Unnamed: 0,0,1,2
0,Dominique,Lockett,M
1,Jordan,Mroz,M
2,Jesse,Woollems,J


In [154]:
# add column names and assign the data frame to a variable

data = pd.DataFrame(names, columns=['First','Last','Middle'])
data

Unnamed: 0,First,Last,Middle
0,Dominique,Lockett,M
1,Jordan,Mroz,M
2,Jesse,Woollems,J


In [155]:
data.size # no parentheses here

9

In [156]:
# another dataframe

# When you format you panda like a dictionary, it automatically knows the column names
allergies = {
    'Jasmihn' : {'bananas': ['itching','angioedema'], 'peanuts': ['anaphylaxis','angioedema','hives']},
    'Joe':{'pollen': ['itching','sneezing'], 'milk':['hives','vomiting','indigestion']},
    'Sally':{'soy': ['stomach cramps'], 'shellfish':['anaphylaxis','angioedema','hives']}}

pnda_all= pd.DataFrame(allergies) # we can convert our dictionaries into neat panda dataframes
print(pnda_all)

                                    Jasmihn                             Joe  \
bananas               [itching, angioedema]                             NaN   
peanuts    [anaphylaxis, angioedema, hives]                             NaN   
pollen                                  NaN             [itching, sneezing]   
milk                                    NaN  [hives, vomiting, indigestion]   
soy                                     NaN                             NaN   
shellfish                               NaN                             NaN   

                                      Sally  
bananas                                 NaN  
peanuts                                 NaN  
pollen                                  NaN  
milk                                    NaN  
soy                        [stomach cramps]  
shellfish  [anaphylaxis, angioedema, hives]  


### Basic Functions

In [157]:
# Concatenate
my_data1 = pd.DataFrame({'key1': ['green', 'green', 'red'], 'key2': [' one', 'two', 'one'], 'data1': [1, 2, 3], 'data1': [9,10,11]})

my_data2 = pd.DataFrame({'key1': ['green', 'green', 'red', 'red'],  'key2': [' one', 'one', 'one', 'two'], 'data1': [14, 25, 16, 17]})

my_data3 = pd.DataFrame({'key1': ['red', 'green', 'red', 'red'],  'key2': [' two', 'one', 'one', 'two'], 'data1': [24, 53, 26, 73], 'data2': [42, 52, 62, 27]})



In [158]:
print(my_data1)

print(my_data3)

    key1  key2  data1
0  green   one      9
1  green   two     10
2    red   one     11
    key1  key2  data1  data2
0    red   two     24     42
1  green   one     53     52
2    red   one     26     62
3    red   two     73     27


In [159]:
pd.concat([my_data1, my_data2, my_data3]) # notice though the indexing is not ideal

Unnamed: 0,key1,key2,data1,data2
0,green,one,9,
1,green,two,10,
2,red,one,11,
0,green,one,14,
1,green,one,25,
2,red,one,16,
3,red,two,17,
0,red,two,24,42.0
1,green,one,53,52.0
2,red,one,26,62.0


In [160]:
pd.concat([my_data1, my_data2, my_data3], ignore_index = True) 

Unnamed: 0,key1,key2,data1,data2
0,green,one,9,
1,green,two,10,
2,red,one,11,
3,green,one,14,
4,green,one,25,
5,red,one,16,
6,red,two,17,
7,red,two,24,42.0
8,green,one,53,52.0
9,red,one,26,62.0


In [161]:
# create random number and view the first few values

long_series = pd.Series(np.random.randn(100)) # randn = random normal distribution
long_series.head()

0   -0.809100
1   -0.675608
2   -0.834457
3    0.598040
4    1.363136
dtype: float64

In [162]:

long_series.tail()

95   -0.341264
96    1.407253
97   -1.499638
98   -0.972503
99    0.401955
dtype: float64

In [163]:
# more attributes can be observed using the normal <TAB> exploration
#long_series.
long_series.abs()

0     0.809100
1     0.675608
2     0.834457
3     0.598040
4     1.363136
        ...   
95    0.341264
96    1.407253
97    1.499638
98    0.972503
99    0.401955
Length: 100, dtype: float64

In [164]:
list('ABCD')

['A', 'B', 'C', 'D']

In [165]:
# more on making your own panda object

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) #np.random.randint(start.value, end.value., size = (#rows, #columns))
df

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [166]:
#Transpose your data
pnda_all.T

# the construction of our dictionary wasn't the most reasonable way to make the desired dataframe

Unnamed: 0,bananas,peanuts,pollen,milk,soy,shellfish
Jasmihn,"[itching, angioedema]","[anaphylaxis, angioedema, hives]",,,,
Joe,,,"[itching, sneezing]","[hives, vomiting, indigestion]",,
Sally,,,,,[stomach cramps],"[anaphylaxis, angioedema, hives]"


### Sort data 

In [167]:
# sort 

# We can change the order of the columns

df.sort_index(axis=1, ascending =False)

Unnamed: 0,D,C,B,A
0,42,51,42,5
1,75,81,53,86
2,18,71,66,82
3,0,32,38,44
4,13,58,87,91
...,...,...,...,...
95,69,44,57,31
96,76,26,70,15
97,90,68,50,99
98,47,55,14,78


In [168]:
# or the rows
df.sort_index(axis=0, ascending = False)

Unnamed: 0,A,B,C,D
99,31,64,29,84
98,78,14,55,47
97,99,50,68,90
96,15,70,26,76
95,31,57,44,69
...,...,...,...,...
4,91,87,58,13
3,44,38,32,0
2,82,66,71,18
1,86,53,81,75


In [169]:
# we can sort by a certain column; see the row order changes for all
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
66,61,3,41,82
56,5,5,21,60
34,41,7,60,91
12,30,8,77,26
39,30,9,14,6
...,...,...,...,...
53,51,91,72,47
37,49,92,31,53
57,79,92,8,33
61,88,93,25,45


If all this group stuff is confusing with just numbers look at [this](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) example that I drew from which uses basketball teams to explain.

## Other functions

In [170]:
# copy
df2 = df.copy() # make a new copy that doesn't impact the original

In [171]:
# new columns
New = [1]*33 + [2]*33 + [3]*33 + [4]
df2['NewCol'] = New

In [172]:
df2


Unnamed: 0,A,B,C,D,NewCol
0,5,42,51,42,1
1,86,53,81,75,1
2,82,66,71,18,1
3,44,38,32,0,1
4,91,87,58,13,1
...,...,...,...,...,...
95,31,57,44,69,3
96,15,70,26,76,3
97,99,50,68,90,3
98,78,14,55,47,3


In [173]:
df

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [174]:
df2 = df2.replace(42,0.00)
df2

Unnamed: 0,A,B,C,D,NewCol
0,5,42,51,42,1
1,86,53,81,75,1
2,82,66,71,18,1
3,44,38,32,0,1
4,91,87,58,13,1
...,...,...,...,...,...
95,31,57,44,69,3
96,15,0,26,76,3
97,99,50,68,90,3
98,78,14,55,47,3


In [175]:
#where allows us to isolate variables with certian conditions
df2 = df2.where(df2<70)
df2

Unnamed: 0,A,B,C,D,NewCol
0,5.0,42.0,51.0,42.0,1
1,,53.0,,,1
2,,66.0,,18.0,1
3,44.0,38.0,32.0,0.0,1
4,,,58.0,13.0,1
...,...,...,...,...,...
95,31.0,57.0,44.0,69.0,3
96,15.0,0.0,26.0,,3
97,,50.0,68.0,,3
98,,14.0,55.0,47.0,3


In [176]:
#dropna can be used on all of the data
df2.dropna()

Unnamed: 0,A,B,C,D,NewCol
0,5.0,42.0,51.0,42.0,1
3,44.0,38.0,32.0,0.0,1
8,35.0,31.0,16.0,12.0,1
11,26.0,30.0,23.0,37.0,1
13,0.0,69.0,15.0,60.0,1
14,10.0,20.0,46.0,38.0,1
21,43.0,45.0,12.0,60.0,1
24,68.0,12.0,1.0,35.0,1
28,41.0,45.0,50.0,18.0,1
32,52.0,23.0,54.0,51.0,1


In [177]:
#or we can drop na's depending on a certain column

df2.dropna(subset=["A"])

Unnamed: 0,A,B,C,D,NewCol
0,5.0,42.0,51.0,42.0,1
3,44.0,38.0,32.0,0.0,1
6,8.0,19.0,15.0,,1
8,35.0,31.0,16.0,12.0,1
9,44.0,,53.0,5.0,1
...,...,...,...,...,...
92,0.0,,34.0,53.0,3
93,37.0,,41.0,42.0,3
95,31.0,57.0,44.0,69.0,3
96,15.0,0.0,26.0,,3


### Indexing

In [178]:
df

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [179]:
data

Unnamed: 0,First,Last,Middle
0,Dominique,Lockett,M
1,Jordan,Mroz,M
2,Jesse,Woollems,J


In [180]:
# use indexing like we have leanred
data['First']

0    Dominique
1       Jordan
2        Jesse
Name: First, dtype: object

In [181]:
pnda_all

Unnamed: 0,Jasmihn,Joe,Sally
bananas,"[itching, angioedema]",,
peanuts,"[anaphylaxis, angioedema, hives]",,
pollen,,"[itching, sneezing]",
milk,,"[hives, vomiting, indigestion]",
soy,,,[stomach cramps]
shellfish,,,"[anaphylaxis, angioedema, hives]"


In [182]:
pnda_all['Joe']

bananas                                 NaN
peanuts                                 NaN
pollen                  [itching, sneezing]
milk         [hives, vomiting, indigestion]
soy                                     NaN
shellfish                               NaN
Name: Joe, dtype: object

### .loc attribute

In [183]:
data[0:3] # we can call on a range of rows

Unnamed: 0,First,Last,Middle
0,Dominique,Lockett,M
1,Jordan,Mroz,M
2,Jesse,Woollems,J


In [184]:

pnda_all['soy':'shellfish'] # call on a range of rows



Unnamed: 0,Jasmihn,Joe,Sally
soy,,,[stomach cramps]
shellfish,,,"[anaphylaxis, angioedema, hives]"


In [185]:
#data[1] # but cannot call on  a single row

In [186]:
#pnda_all['soy'] # cannot just call on a single row

In [187]:
data.loc[0] # unless we add.loc

First     Dominique
Last        Lockett
Middle            M
Name: 0, dtype: object

In [188]:
pnda_all.loc['soy']

Jasmihn                 NaN
Joe                     NaN
Sally      [stomach cramps]
Name: soy, dtype: object

In [189]:
# More indexing

df[0:3]


Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18


In [190]:
# cannot double index without loc
#df[:50,"A"]

In [191]:
#df[2,"C"]

In [192]:
df.loc[:50,"A"]

0      5
1     86
2     82
3     44
4     91
5     77
6      8
7     78
8     35
9     44
10    30
11    26
12    30
13    70
14    10
15    19
16    70
17    18
18    43
19    90
20    62
21    43
22    91
23    51
24    68
25     7
26    80
27    88
28    41
29    66
30    50
31    59
32    52
33    94
34    41
35     6
36    36
37    49
38    69
39    30
40    88
41    52
42    95
43    62
44    86
45    50
46     8
47    24
48    44
49    39
50    60
Name: A, dtype: int32

In [193]:
#df[:50,["A"]]

In [194]:
df.loc[:50, ['A','D']]

Unnamed: 0,A,D
0,5,42
1,86,75
2,82,18
3,44,0
4,91,13
5,77,36
6,8,75
7,78,9
8,35,12
9,44,5


df.loc[2:3, ['C',"D"]]

In [195]:
df.loc[[2,13,12],'C']

2     71
13    15
12    77
Name: C, dtype: int32

In [196]:
df.loc[2,'C']

71

## Attributes of columns and rows (index)

In [197]:
# Calling columns and rows.
# Let's say we want to change a column name. There are a couple of ways

df.A

0      5
1     86
2     82
3     44
4     91
      ..
95    31
96    15
97    99
98    78
99    31
Name: A, Length: 100, dtype: int32

In [198]:
# columns have their own set of attributes

#df.A.<TAB>

df.A.isin(range(0,50)) # we can see if the values of a column are between 0 and 49

0      True
1     False
2     False
3      True
4     False
      ...  
95     True
96     True
97    False
98    False
99     True
Name: A, Length: 100, dtype: bool

In [199]:
# we can call on the columns of a dataset and change all their names

df.columns = ['This one', 'That one', 'And','The other']
df

Unnamed: 0,This one,That one,And,The other
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [200]:
# but now there are spaces in the column names so we can't call on them with a period

#df.This one

In [201]:
# instead we will have to go back to brackets

df['This one']

0      5
1     86
2     82
3     44
4     91
      ..
95    31
96    15
97    99
98    78
99    31
Name: This one, Length: 100, dtype: int32

In [202]:
# and we can still use the column attributes
df['This one'].isin(range(20,30))

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: This one, Length: 100, dtype: bool

In [203]:
# change them all back
df.columns = ['A', 'B', 'C','D']


In [204]:
# we can also change individual columns and rows (indexes)
df.rename(columns={'A': 'a'}, index={1:'one'})

Unnamed: 0,a,B,C,D
0,5,42,51,42
one,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [205]:
# or we can change multiple

# BUT REMEMBER: if you are not assigning this to a new variable the changes will not be saved
#notice how 1 is back to a number not the word

df.rename(columns={'B': 'b', 'C':'c'}, index={2:'two', 3:'three'})

Unnamed: 0,A,b,c,D
0,5,42,51,42
1,86,53,81,75
two,82,66,71,18
three,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


## Subsetting

In [206]:
# Boolean

df[df>20] # returns the normal dataframe but puts NaN where the condition is not true

Unnamed: 0,A,B,C,D
0,,42.0,51.0,42.0
1,86.0,53.0,81.0,75.0
2,82.0,66.0,71.0,
3,44.0,38.0,32.0,
4,91.0,87.0,58.0,
...,...,...,...,...
95,31.0,57.0,44.0,69.0
96,,70.0,26.0,76.0
97,99.0,50.0,68.0,90.0
98,78.0,,55.0,47.0


In [207]:
# remove missing variables

df2 = df[df>20].dropna()

df2 # you have to assign this to a new variable to keep it

Unnamed: 0,A,B,C,D
1,86.0,53.0,81.0,75.0
5,77.0,44.0,81.0,36.0
11,26.0,30.0,23.0,37.0
18,43.0,86.0,55.0,31.0
19,90.0,58.0,71.0,72.0
22,91.0,71.0,58.0,26.0
23,51.0,50.0,99.0,71.0
26,80.0,27.0,34.0,94.0
29,66.0,78.0,84.0,77.0
30,50.0,61.0,48.0,71.0


In [208]:
df2

Unnamed: 0,A,B,C,D
1,86.0,53.0,81.0,75.0
5,77.0,44.0,81.0,36.0
11,26.0,30.0,23.0,37.0
18,43.0,86.0,55.0,31.0
19,90.0,58.0,71.0,72.0
22,91.0,71.0,58.0,26.0
23,51.0,50.0,99.0,71.0
26,80.0,27.0,34.0,94.0
29,66.0,78.0,84.0,77.0
30,50.0,61.0,48.0,71.0


In [209]:
# Remove missing variable by column

df2 = df2[df2.A>50]
df2

Unnamed: 0,A,B,C,D
1,86.0,53.0,81.0,75.0
5,77.0,44.0,81.0,36.0
19,90.0,58.0,71.0,72.0
22,91.0,71.0,58.0,26.0
23,51.0,50.0,99.0,71.0
26,80.0,27.0,34.0,94.0
29,66.0,78.0,84.0,77.0
31,59.0,45.0,24.0,80.0
32,52.0,23.0,54.0,51.0
33,94.0,71.0,82.0,29.0


In [210]:
df

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [211]:
# Remove variable by value

df[df.B < 50]

Unnamed: 0,A,B,C,D
0,5,42,51,42
3,44,38,32,0
5,77,44,81,36
6,8,19,15,75
7,78,9,22,9
8,35,31,16,12
11,26,30,23,37
12,30,8,77,26
14,10,20,46,38
17,18,32,90,34


In [212]:
df.A.where(df.A >30)

0      NaN
1     86.0
2     82.0
3     44.0
4     91.0
      ... 
95    31.0
96     NaN
97    99.0
98    78.0
99    31.0
Name: A, Length: 100, dtype: float64

In [213]:
new_df = df.copy().where(df.A >30).dropna()

In [214]:
new_df

Unnamed: 0,A,B,C,D
1,86.0,53.0,81.0,75.0
2,82.0,66.0,71.0,18.0
3,44.0,38.0,32.0,0.0
4,91.0,87.0,58.0,13.0
5,77.0,44.0,81.0,36.0
...,...,...,...,...
94,74.0,77.0,28.0,70.0
95,31.0,57.0,44.0,69.0
97,99.0,50.0,68.0,90.0
98,78.0,14.0,55.0,47.0


In [215]:
# more cell wise selection

df[df.A != 86][0:10] #Keep only values within column A that are NOT 61

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
5,77,44,81,36
6,8,19,15,75
7,78,9,22,9
8,35,31,16,12
9,44,74,53,5


In [216]:
df

Unnamed: 0,A,B,C,D
0,5,42,51,42
1,86,53,81,75
2,82,66,71,18
3,44,38,32,0
4,91,87,58,13
...,...,...,...,...
95,31,57,44,69
96,15,70,26,76
97,99,50,68,90
98,78,14,55,47


In [217]:
# Combining numpy and panda 

df['E'] = np.where(df['A']>=50, 'yes', 'no') # create new dataframe where if the first column value is greater than or equal to 50, the value is yes, otherwise, no
df

Unnamed: 0,A,B,C,D,E
0,5,42,51,42,no
1,86,53,81,75,yes
2,82,66,71,18,yes
3,44,38,32,0,no
4,91,87,58,13,yes
...,...,...,...,...,...
95,31,57,44,69,no
96,15,70,26,76,no
97,99,50,68,90,yes
98,78,14,55,47,yes


### Create a new file

In [218]:
df.to_csv('File.csv')