# Name Search

This notebook shows how can we use `bash` instead of `python` for specific tasks. 

So we have created some files with random numbers, but in two of them there is the name of my friend Julian Triana. One of these two files is a copy of other and just one name was replaced by Julian's.

The task here is finding where Julian is, and of course, show which file we copied into which.

---
We are going to use the `os` library, to access to the operative system services

In [1]:
import os

We have all our data on a folder, and a couple of trash file+folder inside.

In [2]:
ls names_files/

0.dat     20.dat    32.dat    44.dat    56.dat    68.dat    8.dat     91.dat
1.dat     21.dat    33.dat    45.dat    57.dat    69.dat    80.dat    92.dat
10.dat    22.dat    34.dat    46.dat    58.dat    7.dat     81.dat    93.dat
11.dat    23.dat    35.dat    47.dat    59.dat    70.dat    82.dat    94.dat
12.dat    24.dat    36.dat    48.dat    6.dat     71.dat    83.dat    95.dat
13.dat    25.dat    37.dat    49.dat    60.dat    72.dat    84.dat    96.dat
14.dat    26.dat    38.dat    5.dat     61.dat    73.dat    85.dat    97.dat
15.dat    27.dat    39.dat    50.dat    62.dat    74.dat    86.dat    98.dat
16.dat    28.dat    4.dat     51.dat    63.dat    75.dat    87.dat    99.dat
17.dat    29.dat    40.dat    52.dat    64.dat    76.dat    88.dat    [34mtest[m[m/
18.dat    3.dat     41.dat    53.dat    65.dat    77.dat    89.dat    test.dat
19.dat    30.dat    42.dat    54.dat    66.dat    78.dat    9.dat
2.dat     31.dat    43.dat    55.dat    67.dat    79.dat    90.

Here we introduce an example of how to ignore some of the outputs of a given function

In [3]:
def f(x):
    return x,2*x,x**3
_,_,a=f(2)
print(a)

8


So we can ignore the current file and the folders inside `names_files` so we just keep the names of the files there

In [4]:
for _,_,file_names in os.walk("names_files"):
    break

We delete the `test.dat` from the list,

In [5]:
names=[]
for name in file_names:
    if not name.startswith('t'):
        names.append(name)

And we build a non extension file list, 

In [6]:
file_names=[(index[:-4]) for index in names]

In [7]:
file_names.sort()

We read all the files so we can find Julian revising them all,

In [8]:
import pandas as pd

In [9]:
df=pd.DataFrame()
for file in file_names:
    data=pd.read_csv("names_files/"+file+'.dat',header=None)

    if 'Julian Triana' in data.values:
        print(file,list(data.values).index('Julian Triana')+1)
    


36 93
39 50


One could also import all the data to a `pd.DataFrame`, if more analysis is needed

In [10]:
df=pd.DataFrame()
for file in file_names:
    data=pd.read_csv("names_files/"+file+'.dat',header=None)
    df[file]=data.values.reshape(100)

In [11]:
df

Unnamed: 0,0,1,10,11,12,13,14,15,16,17,...,90,91,92,93,94,95,96,97,98,99
0,Joshua Kaufman,Sharon Escobedo,Ethel Woods,Loretta Erickson,Danny Green,Timothy Tripp,Chris Mason,Alejandra Stennett,John Backlund,Devin Jones,...,Geraldine Ferris,Richard Benn,Karyl Hernandez,Shauna Garr,Jimmy Morgan,Stanley Barbour,Joanne Sanchez,Hugh Warner,Marvin Sweitzer,Patricia Alexander
1,Stephen Delacruz,Charles Davis,Erica Demaire,Luis Kottke,Kim Basile,Dottie Wren,Kamilah Nichols,Lynn Thomas,Elva Medovich,Mark Manning,...,Amanda Lavelle,Ronald Rush,Amy Samson,William Maggard,Betty Mohr,Linda Worsham,Geraldine Sanchez,Eileen Valencia,Mary Bintner,James Beard
2,Ramona Dwaileebe,Matthew Fox,Dennis Bland,Jay Omalley,Mildred Nolin,Richard Bates,Erin Schwab,John Strissel,Hanna Walker,Donna Barnard,...,Lidia Connelly,Milton Mcgary,Alexis Eggleston,Willie Dunn,Cherly Bartholf,Sandra Grace,William Patterson,Edwin Strauss,Sharon Hawthorne,Gustavo Otis
3,April Wagner,Christine Webb,Scott Knowles,John Vastardis,Kevin Auston,John Hurd,Jose Kelly,Gertrude Gore,Donna Allen,Ronald Mcvea,...,Ralph Watson,Lana Orbin,Jimmy Rice,Eileen Williams,David Murphy,Connie Gross,Charlie Harian,Arthur Swint,Brenda Ashley,Frances Spencer
4,Vivian Crum,Darrell Cover,Paula Hyde,Robert Meservey,Mary Smothers,Joshua Sponaugle,Gordon Spann,Royce Fielder,Derrick Hansen,Melody Stone,...,Kenneth Cohen,Amy Smith,Christine Smith,Ronald Donnally,Pamela Engleman,Patricia Eilers,Doris Farruggio,Jacob Willis,Shirley Mccarthy,Mary Stoviak
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Jason Spencer,Randy Welch,Paula Tadych,Kimberly Clifton,Ida Rendon,Tracy Davis,David Goss,John Hineline,Vance Swaney,Barbara Hughes,...,Jaime Norris,Bonnie Packer,Jose Smith,Michael Garcia,Betty Cupples,Martha Hinton,Elizabeth Bond,Thomas Trinidad,Marianne Collins,Douglas Walton
96,Eliseo Demby,Allen Vincent,Marcus Evans,George Martin,Sheila Figueroa,Lloyd Vosmus,Greg Williams,Lucia Washington,Christopher Shepard,Matthew Williams,...,Roland Jacob,John Brunson,Patricia Fahie,Evelyn Newton,Ray Halloran,Dollie Mannings,Eric Miller,Travis Cruz,Henry Stephenson,Christopher Wiles
97,Peter Huber,Celia Bell,Terry Rayborn,Robert Platt,Marcella Jiminez,Jimmy Spry,Caroline Smith,Robert Whitted,Joni Olvera,Jeffrey Perretta,...,Carolyn Martell,Hildegard Saraiva,Robert Martin,Kathy Jackson,Robert Moran,Mary Gonzalez,Jay May,Margaret Bailey,James Flores,Stephanie Musser
98,Louis Lawrence,Delilah Reyna,Randal Summers,Carol Brown,Steve Guess,Shauna Harlin,Wendy Dial,Mary Franks,Michael Cushenberry,Jerome Hartman,...,Donna Camp,Joy Gagnon,Steven Mister,Mary Bradley,Bridgett Flannagan,Roy Fitzgibbon,Richard Crump,George Beauford,Joseph Lamb,Edward Robinson


We can simply look for the places Julian is on the DataFrame

In [12]:
df=='Julian Triana'

Unnamed: 0,0,1,10,11,12,13,14,15,16,17,...,90,91,92,93,94,95,96,97,98,99
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
96,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
97,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
98,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


But it is way easier, using the `grep` command on the terminal

In [13]:
%%bash
grep -n Triana names_files/*.dat

names_files/36.dat:93:Julian Triana
names_files/39.dat:50:Julian Triana


So that, we can use the files we got, and compare them with the rest

In [14]:
%%bash
for i in $(ls names_files/*.dat)
do 
    diff $i names_files/39.dat | wc > aux.data
    echo $i $(cat aux.data)
done

names_files/0.dat 202 602 3233
names_files/1.dat 202 602 3207
names_files/10.dat 202 602 3192
names_files/11.dat 202 602 3192
names_files/12.dat 202 602 3216
names_files/13.dat 202 602 3230
names_files/14.dat 4 8 43
names_files/15.dat 202 602 3244
names_files/16.dat 202 602 3233
names_files/17.dat 202 602 3224
names_files/18.dat 202 602 3239
names_files/19.dat 202 602 3226
names_files/2.dat 202 602 3210
names_files/20.dat 202 602 3253
names_files/21.dat 202 602 3219
names_files/22.dat 202 602 3228
names_files/23.dat 202 602 3243
names_files/24.dat 202 602 3238
names_files/25.dat 202 602 3209
names_files/26.dat 202 602 3249
names_files/27.dat 202 602 3234
names_files/28.dat 202 602 3277
names_files/29.dat 202 602 3196
names_files/3.dat 202 602 3218
names_files/30.dat 202 602 3213
names_files/31.dat 202 602 3220
names_files/32.dat 202 602 3211
names_files/33.dat 202 602 3207
names_files/34.dat 202 602 3239
names_files/35.dat 202 602 3229
names_files/36.dat 202 598 3249
names_files/37.dat

So that the file 14 was copied into the 39